EXHIBIT A 



This part presents several algorithms that solve the following sorting prob- 
lem: 

Input: A sequence of n numbers (a\,a 2 ,...,a„). 

Output: A permutation (reordering) (a\,a' 2 ,...,a'„) of the input sequence^ 
such that a\ < a' 2 < ■ ■ < a' n . f 

The input sequence is usually an n-element array, although it may be rep- - 
resented in some other fashion, such as a linked list. 

The structure of the data 

In practice, the numbers to be sorted are rarely isolated values. Each is 
usually part of a collection of data called a record. Each record contains 
a key, which is the value to be sorted, and the remainder of the record 
consists of satellite data, which are usually carried around with the key. In 
practice, when a sorting algorithm permutes the keys, it must permute the 
satellite data as well. If each record includes a large amount of satellite 
data, we often permute an array of pointers to the records rather than the 
records themselves in order to minimize data movement. 

In a sense, it is these implementation details that distinguish an algo- 
rithm from a full-blown program. Whether we sort individual numbers or 
large records that contain numbers is irrelevant to the method by which 
a sorting procedure determines the sorted order. Thus, when focusing on 
the problem of sorting, we typically assume that the input consists only 
of numbers. The translation of an algorithm for sorting numbers into a 
program for sorting records is conceptually straightforward, although in 
a given engineering situation there may be other subtleties that make the 
actual programming task a challenge. 
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Sorting algorithms 

We introduced two algorithms that sort n real numbers in Chapter 1. In- 
sertion sort takes Q(n 2 ) time in the worst case. Because its inner loops are 
tight, however, it is a fast in-place sorting algorithm for small input sizes. 
(Recall that a sorting algorithm sorts in place if only a constant number of 
elements of the input array are ever stored outside the array.) Merge sort 
has a better asymptotic running time, O(rtlgn), but the Merge procedure 
it uses does not operate in place. 

In this part, we shall introduce two more algorithms that sort arbitrary 
real numbers. Heapsort, presented in Chapter 7, sorts n numbers in place 
in 0{n\%n) time. It uses an important data structure, called a heap, to 
implement a priority queue. 

Quicksort, in Chapter 8, also sorts n numbers in place, but its worst- 
case running time is 0(« 2 ). Its average-case running time is Q(n\gn), 
though, and it generally outperforms heapsort in practice. Like insertion 
sort, quicksort has tight code, so the hidden constant factor in its running 
time is small. It is a popular algorithm for sorting large input arrays. 

Insertion sort, merge sort, heapsort, and quicksort are all comparison 
sorts: they determine the sorted order of an input array by comparing ele- 
ments. Chapter 9 begins by introducing the decision-tree model in ordei' to 
study the performance limitations of comparison sorts. Using this rftodel, 
we prove a lower bound of Q(« Ig n) on the worst-case running time of any 
comparison sort on n inputs, thus showing that heapsort and merge sort 
are asymptotically optimal comparison sorts. 

Chapter 9 then goes on to show that we can beat this lower bound of 
Q(« lg ri) if we can gather information about the sorted order of the input 
by means other than comparing elements. The counting sort algorithm, 
for example, assumes that the input numbers are in the set {1,2, . ..,k}. 
By using array indexing as a tool for determining relative order, counting 
sort can sort n numbers in 0(k + n) time. Thus, when k = 0{n), counting 
sort runs in time that is linear in the size of the input array. A related 
algorithm, radix sort, can be used to extend the range of counting sort. 
If there are n integers to sort, each integer has d digits, and each digit 
is in the set {1, 2, ... , k), radix sort can sort the numbers in 0(d{n + k)) 
time. When d is a constant and k is 0{n), radix sort runs in linear time. 
A third algorithm, bucket sort, requires knowledge of the probabilistic 
distribution of numbers in the input array. It can sort n real numbers 
uniformly distributed in the half-open interval [0, 1) in average-case 0{n) 
time. . 

Order statistics 

The /th order statistic of a set of n numbers is the z'th smallest number 
in the set. One can, of course, select the zth order statistic by sorting the 
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input and indexing the ith element of the output. With no assumptions 
about the input distribution, this method runs in Q(rtlgrt) time, as the 
lower bound proved in Chapter 9 shows. 

In Chapter 10, we show that we can find the ith smallest element in 0(n) 
time, even when the elements are arbitrary real numbers. We present an 
algorithm with tight pseudocode that runs in 0{n 2 ) time in the worst case, 
but linear time on average. We also give a more complicated algorithm 
that runs in 0(n) worst-case time. 

Background 

Although most of this part does not rely on difficult mathematics, some 
sections do require mathematical sophistication. In particular, the average- 
case analyses of quicksort, bucket sort, and the order-statistic algorithm 
use probability, which is reviewed in Chapter 6. The analysis of the worst- 
case linear-time algorithm for the order statistic involves somewhat more 
sophisticated mathematics than the other worst-case analyses in this part. 



7 Heapsort 



In this chapter, we introduce another sorting algorithm. Like merge sort, 
but unlike insertion sort, heapsort's running time is O(nlgn). Like inser- 
tion sort, but unlike merge sort, heapsort sorts in place: only a constant 
number of array elements are stored outside the input array at any time. 
Thus, heapsort combines the better attributes of the two sorting algorithms 
we have already discussed. 

Heapsort also introduces another algorithm design technique: the use of 
a data structure, in this case one we call a "heap," to manage information 
during the execution of the algorithm. Not only is the heap data structure 
useful for heapsort, it also makes an efficient priority queue. The heap 
data structure will reappear in algorithms in later chapters. 

We note that the term "heap" was originally coined in the context of 
heapsort, but it has since come to refer to "garbage-collected storage," 
such as the programming language Lisp provides. Our heap data structure 
is not garbage-collected storage, and whenever we refer to heaps in this 
book, we shall mean the structure defined in this chapter. 



7.1 Heaps 

The (binary) heap data structure is an array object that can be viewed as 
a complete binary tree (see Section 5.5.3), as shown in Figure 7.1. Each 
node of the tree corresponds to an element of the array that stores the value 
in the node. The tree is completely filled on all levels except possibly the 
lowest, which is filled from the left up to a point. An array A that represents 
a heap is an object with two attributes: length[A], which is the number of 
elements in the array, and heap-size[A], the number of elements in the heap 
stored within array A. That is, although A[l . . length[A]] may contain valid 
numbers, no element past A[heap-size[A\], where heap-size[A] < length[A], 
is an element of the heap. The root of the tree is A[l], and given the index 
i of a node, the indices of its parent Parent(/), left child Left(z'), and 
right child Right(/) can be computed simply: 



7. 1 Heaps 



141 



l 




|i6p-4| 1B |»|7|»|3|2|4TTl 



123456789 10 



(b) 



Figure 7.1 A heap viewed as (a) a binary tree and (b) an array. The number 
within the circle at each node in the tree is the value stored at that node. The 
number next to a node is the corresponding index in the array. 

Parent(j) 
return [i/2] 

Left(z) 
, return 2/ 



On most computers, the Left procedure can compute 2i in one instruc- 
tion by simply shifting the binary representation of i left one bit position. 
Similarly, the Right procedure can quickly compute 2z.+ 1 by shifting 
the binary representation of i left one bit position and shifting in a 1 as 
the low-order bit. The Parent procedure can compute \i/2\ by shifting i 
right one bit position. In a good implementation of heapsort, these three 
procedures are often implemented as "macros" or "in-line" procedures. 

Heaps also satisfy the heap property: for every node i other than the 
root, 

/1[Parent(i)] > A[i] , (7.1) 

that is, the value of a node is at most the value of its parent. Thus, the 
largest element in a heap is stored at the root, and the subtrees rooted at 
a node contain smaller values than does the node itself. 

We define the height of a node in a tree to be the number of edges 
on the longest simple downward path from the node to a leaf, and we 
define the height of the tree to be the height of its root. Since a heap of 
n elements is based on a complete binary tree, its height is 0(lg«) (see 
Exercise 7.1-2). We shall see that the basic operations on heaps run in 
time at most proportional to the height of the tree and thus take 0(\gn) 
time. The remainder of this chapter presents five basic procedures and 



RlGHT(i) 

return 2i + 1 
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shows how they are used in a sorting algorithm and a priority-queue data 
structure. 

• The Heapify procedure, which runs in <9(lg n) time, is the key to main- 
taining the heap property (7.1). 

• The Build-Heap procedure, which runs in linear time, produces a heap 
from an unordered input array. 

• The Heapsort procedure, which runs in 0{n\gn) time, sorts an array 
in place. 

• The Extract-Max and Insert procedures, which run in 0{\gn) time, 
allow the heap data structure to be used as a priority queue. 

Exercises 

7.1-1 

What are the minimum and maximum numbers of elements in a heap of 
height hi 

7.1-2 

Show that an n-element heap has height |Jg n\ . 
7.1-3 

, Show that the largest element in a subtree of a heap is at the root of the 
subtree. 

7.1-4 

Where in a heap might the smallest element reside? 
7.1-5 

Is an array that is in reverse sorted order a heap? 
7.1-6 

Is the sequence (23, 17, 14,6, 13, 10, 1,5,7, 12) a heap? 



7.2 Maintaining the heap property 

Heapify is an important subroutine for manipulating heaps. Its inputs 
are an array A and an index /' into the array. When Heapify is called, it is 
assumed that the binary trees rooted at Left(z') and Right(/) are heaps, 
but that A[i] may be smaller than its children, thus violating the heap 
property (7.1). The function of Heapify is to let the value at A[i] "float 
down" in the heap so that the subtree rooted at index i becomes a heap. 
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Figure 7.2 The action of Heapify(,4, 2), where heap-size[A] = 10. (a) The initial 
configuration of the heap, with A[2] at node i = 2 violating the heap property 
since it is not larger than both children. The heap property is restored for node 2 
in (b) by exchanging A[2] with A[4], which destroys the heap property for node 4. 
The recursive call Heapify(^4, 4) now sets / = 4. After swapping A[4] with A[9], 
as shown in (c), node 4 is fixed up, and the recursive call Heapify(v4, 9) yields no 
further change to the data structure. 



HEAPIF,Y(/4, /) 

1 /4-Lbft(i) 

2 r <- Right(z') 

3 if / < heap-size[A] and A[l] > A[i] 

4 then largest *- 1 

5 else largest *- i 

6 if r < heap-size[A] and A[r] > A[largest] 

7 then largest «- r 

8 if largest ^ i 

9 then exchange <-» A[largest] 
10 ' HEAPiFY(y4, largest) 

Figure 7.2 illustrates the action of Heapify. At each step, the largest 
of the elements A[i], ^[Left(/)], and ^[Right(/)] is determined, and its 
index is stored in largest. If A[i] is largest, then the subtree rooted at node i 
is a heap and the procedure terminates. Otherwise, one of the two children 
has the largest element, and A[i] is swapped with A[largest], which causes 
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node /■ and its children to satisfy the heap property. The node largest, 
however, now has the original value A[i), and thus the subtree rooted at 
largest may violate the heap property. Consequently, Heapify must be 
called recursively on that subtree. 

The running time of Heapify on a subtree of size n rooted at given 
node i is the 6(1) time to fix up the relationships among the elements 
A\i] ^[LeftCz)], and ^[Right(i)], plus the time to run Heapify on a 
subtree rooted at one of the children of node /. The children's subtrees 
each have size at most 2n/3-the worst case occurs when the last row of 
the tree is exactly half full-and the running lime of Heapify can therefore 
be described by the recurrence 
r(«)< 7X2/1/3) + 6(1) . 

The solution to this recurrence, by case 2 of the master theorem (Theo- 
rem 4.1), is T{n) = 0(\gn). Alternatively, we can characterize the running 
time of Heapify on a node of height h as 0(h). 



Exercises 



7 2-1 

Using Figure 7.2 as a model, illustrate the operation of Heapify( j 4 ) 3) on 
the array A = (27, 1 7, 3, 1 6, 1 3, 10, 1, 5, 7, 12, 4, 8, 9, 0). 

What is the effect of calling Heapify(^, i) when the element A[i] is larger 
than its children? 

7.2-3 

What is the effect of calling Heapify^, i) for i > heap-size[A]/21 
7.2-4 

The code for Heapify is quite efficient in terms of constant factors, except 
possibly for the recursive call in line 10, which might cause some compil- 
ers to produce inefficient code. Write an efficient Heapify that uses an 
iterative control construct (a loop) instead of recursion. 

7.2-5 - . 

Show that the worst-case running time of Heapify on a heap of size n 
is O(lgn). {Hint: For a heap with n nodes, give node values that cause 
Heapify to be called recursively at every node on a path from the root 
down to a leaf.) 
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We can use the procedure Heapify in a bottom-up manner to convert 
an array A[\ ..«], where n = length[A], into a heap. Since the elements 
in the subarray A[([n/2\ + l)..n] are all leaves of the tree, each is a 1- 
element heap to begin with. The procedure Build-Heap goes through the 
remaining nodes of the tree and runs Heapify on each one. The order 
in which the nodes are processed guarantees that the subtrees rooted at 
children of a node /' are heaps before Heapify is run at that node. 

Build-Heap(^4) 

1 heap-size[A] *- length[A] 

2 for i *- [length[A]/2] downto 1 

3 do Heapify(v4, i) 

Figure 7.3 shows an example of the action of Build-Heap. 

We can compute a simple upper bound on the running time of Build- 
Heap as follows. Each call to Heapify costs 0(\gn) time, and there are 
0(n) such calls. Thus, the running time is at most 0(n\gn). This upper 
bound, though correct, is not asymptotically tight. 

We can derive a tighter bound by observing that the time for Heapify 
to run at a node varies with the height of the node in the tree, and the 
heights of most nodes are small. Our tighter analysis relies on the property 
that in an H-element heap there are at most \n/2 h+l ] nodes of height h 
(see Exercise 7.3-3). 

The time required by Heapify when called on a node of height hi is 
0(h), so we can express the total cost of Build-Heap as 



The last summation can be evaluated by substituting x = 1 / 2 in the for- 
mula (3.6), which yields 




(7.2) 




(1 - 1/2)2 



i/2 



= 2 . 



Thus, the running time of Build-Heap can be bounded as 




= 0(71). 



Hence, we can build a heap from an unordered array in linear time. 
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Figure 7.3 The operation of Build-Heap, showing the data structure before the 
call to Heapify in line 3 of Build-Heap, (a) A 10-element input array A and the 
binary tree it represents. The figure shows that the loop index i points to node 5 
before the call Heapify(^, i). (b) The data structure that results. The loop index i 
for the next iteration points to node 4. (c)-(e) Subsequent iterations of the for 
loop in Build-Heap. Observe that whenever Heapify is called on a node, the two 
subtrees of that node are both heaps, (f) The heap after Build-Heap finishes. 
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Exercises 
7.3-1 

Using Figure 7.3 as a model, illustrate the operation of Build-Heap on 
the array A = (5,3, 17,10,84, 19,6,22,9). 

7.3-2 

Why do we want the loop index i in line 2 of Build-Heap to decrease 
from [length[A]/2\ to 1 rather than increase from 1 to [length[A]/2\? 

7.3-3 

Show that there are at most \n/2 h+l ] nodes of height h in any /z -element 
heap. 



7.4 The heapsort algorithm 

The heapsort algorithm starts by using Build-Heap to build a heap on the 
input array A[l ..«], where n - Iength[A]. Since the maximum element 
of the array is stored at the root A[l], it can be put into its correct final 
position by exchanging it with A[n]. If we now "discard" node n from the 
heap (by decrementing heap-size[A]), we observe that A[l ..'(« - 1)] can 
easily be made into a heap. The children of the root remain heaps, but the 
new root element may violate the heap property (7.1). All that is needed 
to restore the heap property, however, is one call to Heapify^, 1), which 
leaves a heap in A[\ . . (n - 1)}. The heapsort algorithm then repeats this 
process for the heap of size n - 1 down to a heap of size 2. 

Heapsort(^I) 

1 Build-Heap(vI) 

2 for i <— length[A] downto 2 

3 do exchange A[l] <-> A[i] 

4 heap-size[A] <- heap-size[A] - 1 

5 Heapify(,4, 1) 

Figure 7.4 shows an example of the operation of heapsort after the heap 
is initially built. Each heap is shown at the beginning of an iteration of 
the for loop in line 2. 

The Heapsort procedure takes time 0{n\gn), since the call to Build- 
HeAp takes time 0{n) and each of the n - 1 calls to Heapify takes time 
OQgn). 
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Figure 7.4 The operation of Heapsort. (a) The heap data structure just after it 
has been built by Build-Heap, (b)-(j) The heap just after each call of Heapify in 
line 5. The value of i at that time is shown. Only lightly shaded nodes remain in 
the heap, (k) The resulting sorted array A. 
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Exercises 
7.4-1 

Using Figure 7.4 as a model, illustrate the operation of Heapsort on the 
array ^ = (5, 13,2,25,7, 17,20, 8, 4). 

7.4-2 

What is the running time of heapsort on an array A of length n that is 
already sorted in increasing order? What about decreasing order? 

7.4-3 

Show that the running time of heapsort is Q(nlgn). 



7.5 Priority queues 

Heapsort is an excellent algorithm, but a good implementation of quick- 
sort, presented in Chapter 8, usually beats it in practice. Nevertheless, the 
heap data structure itself has enormous utility. In this section, we present 
one of the most popular applications of a heap: its use as an efficient 
priority queue. 

A priority queue is a data structure for maintaining a set S of elements, 
each with an associated value called a key. A priority queue supports the 
following operations. 

Insert^, x) inserts the element x into the set S. This operation could be 

written as S «- S U {x}. 
Maximum^) returns the element of S with the largest key. 
Extract-Max^) removes and returns the element of S with the largest 

key. 

One application of priority queues is to schedule jobs on a shared com- 
puter. The priority queue keeps track of the jobs to be performed and 
their relative priorities. When a job is finished or interrupted, the highest- 
priority job is selected from those pending using Extract-Max. A new 
job can be added to the queue at any time using Insert. 

A priority queue can also be used in an event-driven simulator. The 
v items in the queue are events to be simulated, each with an associated 

time of occurrence that serves as its key. The events must be simulated 
in order of their time of occurrence, because the simulation of an event 
can cause other events to be simulated in the future. For this application, 
it is natural to reverse the linear order of the priority queue and support 
the operations Minimum and Extract-Min instead of Maximum and 
Extract-Max. The simulation program uses Extract-Min at each step 
to choose the next event to simulate. As new events are produced, they 
are inserted into the priority queue using Insert. 
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Not surprisingly, we can use a heap to implement a priority queue. The 
operation Heap-Maximum returns the maximum heap element in 0(1) 
time by simply returning the value A[l] in the heap. The Heap-Extract- 
Max procedure is similar to the for loop body (lines 3-5) of the Heapsort 
procedure: 

Heap-Extract-Max(/4 ) 

1 if heaps ize[A] < 1 

2 then error "heap underflow" 

3 max<r-A[l] 

4 A[l]^A[heap-size[A]] 

5 heap-size[A] <- heap-size[A] - 1 

6 Heapify^, 1) 

7 return max 

The running time of Heap-Extract-Max is 0(lg«), since it performs 
only a constant amount of work on top of the 0{\gn) time for Heapify. 

The Heap-Insert procedure inserts a node into heap A. To do so, 
it first expands the heap by adding a new leaf to the tree. Then, in a 
manner reminiscent of the insertion loop (lines 5-7) of Insertion-Sort 
from Section 1. 1, it traverses a path from this leaf toward the root to find 
a proper place for the new element. 

Heap-Insert(^, key) 

1 heap-size[A] <- heap-size[A] + I 

2 i «- heap-size[A] 

3 while i > 1 and ^[Parent(i)] < key 

4 do A[i] «- ^[Parent(z)] 

5 i «- Parent(i') 

6 A[i\*-key 

Figure 7.5 shows an example of a Heap-Insert operation. The running 
time of Heap-Insert on an «-element heap is <9(lg«), since the path traced 
from the new leaf to the root has length 0{\gn). 

In summary, a heap can support any priority-queue operation on a set 
of size n in 0(lg«) time. 

Exercises 

7.5-1 

Using Figure 7.5 as a model, illustrate the operation of Heap-Insert(^4, 3) 
on the heap A = (15, 13, 9, 5, 12, 8, 7, 4, 0, 6, 2, 1). 

7.5-2 

Illustrate the operation of Heap-Extract-Max on the heap A = (15, 13, 
9,5,12,8,7,4,0,6,2,1). 
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(c) (d) 



Figure 7.5 The operation of Heap-Insert, (a) The heap of Figure 7.4(a) before 
we insert a node with key 1 5. (b) A new ]eaf is added to the tree, {c) Values on 
the path from the new leaf to the root are copied down until a place for the key 15 
is found, (d) The key 15 is inserted. 

v > 

7.5-3 

Show how to implement a first-in, first-out queue with a priority queue. 
Show how to implement a stack with a priority queue. (FIFO's and stacks 
are defined in Section 11.1.) 

7.5-4 

Give an 0(lg«)-time implementation of the procedure Heap-Increase- 
Key(A, i,k), which sets A[i] <- max(A[i],k) and updates the heap struc- 
ture appropriately. 

7.5-5 

The operation Heap-Delete(^, /') deletes the item in node i from heap A. 
Give an implementation of Heap-Delete that runs in 0(\g n) time for an 
n-element heap. 

7.5-6 

Give an 0(n\gk)-time algorithm to merge k sorted lists into one sorted 
list, where n is the total number of elements in all the input lists. (Hint: 
Use a heap for k-way merging.) 
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Problems 

7-1 Building a heap using insertion 

The procedure Build-Heap in Section 7.3 can be implemented by repeat- 
edly using Heap-Insert to insert the elements into the heap. Consider the 
following implementation: 

Build-Heap' (A) 

1 heap-size[A] .«- 1 

2 for i <r- 2 to length[A] 

3 do Heap-Insert(^,^[/]) 

a. Do the procedures Build-Heap and Build-Heap' always create the 
same heap when run on the same input array? Prove that they do, 
or provide a counterexample. 

b. Show that in the worst case, Build-Heap' requires 0(«lg«) time to 
build an ^-element heap. 

7-2 Analysis of d-ary heaps 

A d-ary heap is like a binary heap, but instead of 2 children, nodes have 
d children. 

a. How would you represent a rf-ary heap in an array? 

b. What is the height of a d-ary heap of n elements in terms of n and dl 

c. Give an efficient implementation of Extract-Max. Analyze its run- 
ning time in terms of d and n. 

d. Give an efficient implementation of Insert. Analyze its running time 
in terms of d and n. 

e. Give an efficient implementation of Heap -Increase-Key^, /, k), which 
sets A[i] «- max(A[i],k) and updates the heap structure appropriately. 
Analyze*its running time in terms of d and n. 



Chapter notes 

The heapsort algorithm was invented by Williams [202], who also de- 
scribed how to implement a priority queue with a heap. The Build-Heap 
procedure was suggested by Floyd [69]. 
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Quicksort is a sorting algorithm whose worst-case running time is 6(« 2 ) 
on an input array of n numbers. In spite of this slow worst-case running 
time, quicksort is often the best practical choice for sorting because it is 
remarkably efficient on the average: its expected running time is Q{n lgn), 
and the constant factors hidden in the 6(n lg n) notation are quite small. 
It also has the advantage of sorting in place (see page 3), and it works well 
even in virtual memory environments. 

Section 8.1 describes the algorithm and an important subroutine used by 
quicksort for partitioning. Because the behavior of quicksort is complex, 
we start with an intuitive discussion of its performance in Section 8.2 and 
postpone its precise analysis to the end of the chapter. Section 8.3 presents 
two versions of quicksort that use a random-number generator. These 
"randomized" algorithms have many desirable properties. Their average- 
case running time is good, and no particular input elicits their worst-case 
behavior. One of the randomized versions of quicksort is analyzed in 
Section 8.4, where it is shown to run in 0(n 2 ) time in the worst case and 
in 0(n lgn) time on average. 



8.1 Description of quicksort 

Quicksort, like merge sort, is based on the divide-and-conquer paradigm 
introduced in Section 1.3.1. Here is the three-step divide-and-conquer 
process for sorting a typical subarray A[p . . r]. 

Divide: The array A[p . . r] is partitioned (rearranged) into two nonempty 
subarrays A[p . . q] and A[g + 1 . . r] such that each element of A\p . . q] 
is less than or equal to each element of A[q + 1 . . r]. The index q is 
computed as part of this partitioning procedure. 

Conquer: The two subarrays A[p ..q] and A[q + 1 . . r] are sorted by recur- 
sive calls to quicksort. 

Combine: Since the subarrays are sorted in place, no work is needed to 
combine them: the entire array A[p . . r] is now sorted. 

The following procedure implements quicksort. 
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Quicksort^,/?, r) 

1 if p < r 

2 then q <- Partition^,/?,/-) 

3 Quicksort^, p, q) 

4 Quicksort^, q + 1 , r) 

To sort an entire array A, the initial call is Quicksort^, 1, length[A]). 
Partitioning the array 

The key to the algorithm is the Partition procedure, which rearranges 
the subarray A[p . . r] in place. 

Partition^,;?,/-) . 

1 x^A[p] 

2 i < — p — 1 

3 j*-r+\ 

4 while true 



5 do repeat j «- j - 1 

6 until A[j] < x 

7 repeat /<-/'+ 1 
v 8 until ^[i] > * 

9 if z < j 

10 then exchange A[i] *-> A[j] 

11 else return j 



Figure 8. 1 shows how Partition works. It first selects an element x - A\p\ 
from A[p . . r] as a "pivot" element around which to partition A[p..r]. It 
then grows two regions A\p . . i] and A[j . . r] from the top and bottom of 
A[p . . r], respectively, such that every element in A[p . . i] is less than or 
equal to x and every element in A[j ..r] is greater than or equal to x. 
Initially, i=p - 1 and j = r + 1, so the two regions are empty. 

Within the body of the while loop, the index j is decremented and the 
index / is incremented, in lines 5-8, until A[i] > x > A[j]. Assuming that 
these inequalities are strict, A[i] is too large to belong to the bottom region 
and A[j] is too small to belong to the top region. Thus, by exchanging A[i] 
and A[j] as is done in line 10, we can extend the two regions. (If the 
inequalities are not strict, the exchange can be performed anyway.) 

The body of the while loop repeats until > j, at which point the entire 
array A[p . . r] has been partitioned into two subarrays A[p ..q] and A[q + 1 
. . r], where p < q < r, such that no element of A[p ..q] is larger than any 
element of A[q + 1 ../•]. The value q = j is returned at the end of the 
procedure. 

Conceptually, the partitioning procedure performs a simple function: 
it puts elements smaller than x into the bottom region of the array and 



8.1 Description of quicksort 



(b) J 



Atj ^ q] A[q+L.r] 
MM |3|312|U4.,1615]T1 

j return i 

(e) 

Figure 8.1 The operation of Partition on a sample array. Lightly shaded array 
elements have been placed into the correct partitions, and heavily shaded elements 
are not yet in their partitions, (a) The input array, with the initial values of i and j 
just off the left and right ends of the array. We partition around x = A[p] = 5. 
(b) The positions of i and j at line 9 of the first iteration of the while loop, (c) The 
result of exchanging the elements pointed to by / and j in line 10. (d) The positions 
of /' and j at line 9 of the second iteration of the while loop, (e) The positions of 
i and j at line 9 of the third and last iteration of the while loop. The procedure 
terminates because / > j, and the value q = is returned. Array elements up to 
and including A[j] are less than or equal to x = 5, and array elements after A[j] 
are greater than or equal to x = 5. 



elements larger than x into the top region. There are technicalities that 
make the pseudocode of Partition a little tricky, however. For example, 
the indices i and j never index the subarray A[p . . r] out of bounds, but this 
isn't entirely apparent from the code. As another example, it is important 
that A[p] be used as the pivot element x. If A[r] is used instead and it 
happens that A[r] is also the largest element in the subarray A[p . . r], then 
Partition returns to Quicksort the value q = r, and Quicksort loops 
forever. Problem 8-1 asks you to prove Partition correct. 

The running time of Partition on an array A[p..r] is 6(h), where 
n = r - p + 1 (see Exercise 8.1-3). 



Exercises 



8.1-1 

Using Figure 8.1 as a model, illustrate the operation of Partition on the 
array A = (13, 19, 9, 5, 12, 8, 7, 4, 1 1, 2, 6, 21). 

8.1-2 

What value of q does Partition return when all elements in the array 
A[p . . r] have the same value? 
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8.1-3 

Give a brief argument that the running time of Partition on a subarray 
of size n is 0(h). 

8.1-4 

How would you modify Quicksort to sort in nonincreasing order? 



8.2 Performance of quicksort 

The running time of quicksort depends on whether the partitioning is bal- 
anced or unbalanced, and this in turn depends on which elements are used 
for partitioning. If the partitioning is balanced, the algorithm runs asymp- 
totically as fast as merge sort. If the partitioning is unbalanced, however, 
it can run asymptotically as slow as insertion sort. In this section, we shall 
informally investigate how quicksort performs under the assumptions of 
balanced versus unbalanced partitioning. 

Worst-case partitioning 

The worst-case behavior for quicksort occurs when the partitioning routine 
produces one region with n - 1 elements and one with only 1 element. 
(This claim is proved in Section 8.4.1.) Let us assume that this unbalanced 
partitioning arises at every step of the algorithm. Since partitioning costs 
0(h) time and r(l) = 0(1), the recurrence for the running time is 

T(n) = T(n-l) + e(n) . 

To evaluate this recurrence, we observe that T(l) = 0(1) and then iter- 
ate: 

T{n) = T(n-l) + e(n) 

= £e<*) 

k=i 

- 6 (P) 

= 0(H 2 ). 

We obtain the last line by observing that £Li k is the arithmetic se- 
ries (3.2). Figure 8.2 shows a recursion tree for this worst-case execution 
of quicksort. (See Section 4.2 for a discussion of recursion trees.) 

Thus, if the partitioning is maximally unbalanced at every recursive 
step of the algorithm, the running time is 0(h 2 ). Therefore the worst- 
case running time of quicksort is no better than that of insertion sort. 
Moreover, the 0(h 2 ) running time occurs when the input array is already 
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Figure 8.2 A recursion tree for Quicksort in which the Partition procedure 
always puts only a single element on one side of the partition (the worst case). The 
resulting running time is 0(« 2 ). 

completely sorted— a common situation in which insertion sort runs in 
O(n) time. 

Best-case partitioning 

If the partitioning procedure produces two regions of size n/2, quicksort 
runs much faster. The recurrence is then 

T{n) = 2T(n/2)+e(n) , 

which by case 2 of the master theorem (Theorem 4.1) has solution T(n) = 
Q(nlgn). Thus, this best-case partitioning produces a much faster algo- 
rithm. Figure 8.3 shows the recursion tree for this best-case execution of 
quicksort. 

Balanced partitioning 

The average-case running time of quicksort is much closer to the best case 
than to the worst case, as the analyses in Section 8.4 will show. The key 
to understanding why this might be true is to understand how the balance 
of the partitioning is reflected in the recurrence that describes the running 
time. 

Suppose, for example, that the partitioning algorithm always produces a 
9-to-l proportional split, which at first blush seems quite unbalanced. We 
then obtain the recurrence 

T(n) = T(9n/ 10) + T(n/ 10) + n 

on the running time of quicksort, where we have replaced 0(«) by n for 
convenience. Figure 8.4 shows the recursion tree for this recurrence. No- 
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Q(n Ig/i) 

Figure 8.3 A recursion tree for Quicksort in which Partition always balances 
the two sides of the partition equally (the best case). The resulting running time 
is Q(nlgn). 




0(nlgn) 

Figure 8.4 A recursion tree for Quicksort in which Partition always produces 
a 9-to-l split, yielding a running time of 0(n \gn). 
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tice that every level of the tree has cost n, until a boundary condition is 
reached at depth log 10 n = 0(lg«), and then the levels have cost at most n. 
The recursion terminates at depth log, 0/9 n = 0(lg«). The total cost of 
quicksort is therefore 0(nlg«). Thus, with a 9-to-l proportional split at 
every level of recursion, which intuitively seems quite unbalanced, quick- 
sort runs in 0(nlg«) time — asymptotically the same as if the split were 
right down the middle. In fact, even a 99-to-l split yields an 0(n Ign) run- 
ning time. The reason is that any split of constant proportionality yields 
a recursion tree of depth 0(lg/i), where the cost at each level is O(n). 
The running time is therefore 0(«lg«) whenever the split has constant 
proportionality. 

Intuition for the average case 

To develop a clear notion of the average case for quicksort, we must make 
an assumption about how frequently we expect to encounter the various in- 
puts.. A common assumption is that all permutations of the input numbers 
are equally likely. We shall discuss this assumption in the next section, but 
first let's explore its ramifications. 

When we run quicksort on a random input array, it is unlikely that 
the partitioning always happens in the same way at every level, as our 
informal analysis has assumed. We expect that some of the splits will 
be reasonably well balanced and that some will be fairly unbalanced. For 
example, Exercise 8.2-5 asks to you show that about 80 percent of the time 
Partition produces a split that is more balanced than 9 to 1, and about 
20 percent of the time it produces a split that is less balanced than 9 to 1. 

In the average case, Partition produces a mix of "good" and "bad" 
splits. In a recursion tree for an average-case execution of Partition, the 
good and bad splits are distributed randomly throughout the tree. Suppose 
for the sake of intuition, however, that the good and bad splits alternate 
levels in the tree, and that the good splits are best-case splits and the bad 
splits are worst-case splits. Figure 8.5(a) shows the splits at two consecutive 
levels in the recursion tree. At the root of the tree, the cost is n for 
partitioning and the subarrays produced have sizes n - 1 and 1: the worst 
case. At the next level, the subarray of size n- 1 is best-case partitioned into 
two subarrays of size (n - l)/2. Let's assume that the boundary-condition 
cost is 1 for the subarray of size 1. 

The combination of the bad split followed by the good split produces 
three subarrays of sizes 1, (n - l)/2, and (h - l)/2 at a combined cost 
of 2n - 1 = 0(h). Certainly, this situation is no worse than that in Fig- 
ure 8.5(b), namely a single level of partitioning that produces two subarrays 
of sizes (n - l)/2 + 1 and (« - l)/2 at a cost of n = 0(n). Yet this latter 
situation is very nearly balanced, certainly better than 9 to 1. Intuitively, 
the 0(«) cost of the bad split can be absorbed into the 0(n) cost of the 
good split, and the resulting split is good. Thus, the running time of quick- 
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Figure 8.5 (a) Two levels of a recursion tree for quicksort. The partitioning at 
the root costs n and produces a "bad" split: two subarrays of sizes 1 and n - 1. 
The partitioning of the subarray of size n - 1 costs n - 1 and produces a "good" 
split: two subarrays of size (n - l)/2. (b) A single level of a recursion tree that is 
worse than the combined levels in (a), yet very well balanced. 

sort, when levels alternate between good and bad splits, is like the running 
time for good splits alone: still 0{n lgn), but with a slightly larger constant 
hidden by the 0-notation. We shall give a rigorous analysis of the average 
case in Section 8.4.2. 



Exercises 



8.2-1 

Show that the running time of Quicksort is &{n\gn) when all elements 
of array A have the same value. 

8.2-2 

Show that the running time of Quicksort is Q(n 2 ) when the array A is 
sorted in nonincreasing order. 

8.2-3 

Banks often record transactions on an account in order of the times of 
the transactions, but many people like to receive their bank statements 
with checks listed in order by check number. People usually write checks 
in order by check number, and merchants usually cash them with reason- 
able dispatch. The problem of converting time-of-transaction ordering to 
check-number ordering is therefore the problem of sorting almost-sorted 
input. Argue that the procedure Insertion-Sort would tend to beat the 
procedure Quicksort on this problem. 

8.2-4 . 

Suppose that the splits at every level of quicksort are in the proportion 
1 - a to a, where 0 < a < 1/2 is a constant. Show that the minimum 
depth of a leaf in the recursion tree is approximately -lg/j/lga and the 
maximum depth is approximately -lgrc/lg(l - a). (Don't worry about 
integer round-off.) 
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8.2-5 * 

Argue that for any constant 0 < a < 1/2, the probability is approximately 
1 - 2a that on a random input array, Partition produces a split more 
balanced than 1 - a to a. For what value of a are the odds even that the 
split is more balanced than less balanced? 



8.3 Randomized versions of quicksort 

In exploring the average-case behavior of quicksort, we have made an as- 
sumption that all permutations of the input numbers are equally likely. 
When this assumption on the distribution of the inputs is valid, many 
people regard quicksort as the algorithm of choice for large enough in- 
puts. In an engineering situation, however, we cannot always expect it 
to hold. (See Exercise 8.2-3.) This section introduces the notion of a 
randomized algorithm and presents two randomized versions of quicksort 
that overcome the assumption that all permutations of the input numbers 
are equally likely. 

An alternative to assuming a distribution of inputs is to impose a distri- 
bution. For example, suppose that before sorting the input array, quicksort 
randomly permutes the elements to enforce the property that every permu- 
tation is equally likely. (Exercise 8.3-4 asks for an algorithm that randomly 
permutes the elements of an array of size n in time 0{n).) This modifica- 
tion does not improve the worst-case running time of the algorithm, but it 
does make the running time independent of the input ordering. 

We call an algorithm randomized if its behavior is determined not only 
by the input but also by values produced by a random-number genera- 
tor. We shall assume that we have at our disposal a random-number gen- 
erator Random. A call to Random(a,/>) returns an integer between a 
and b, inclusive, with each such integer being equally likely. For example, 
Random(0, 1) produces a 0 with probability 1/2 and a 1 with probabil- 
ity 1/2. Each integer returned by Random is independent of the inte- 
gers returned on previous calls. You may imagine Random as rolling a 
(b -a + l)-sided die to obtain its output. (In practice, most programming 
environments offer a pseudorandom-number generator: a deterministic al- 
gorithm that returns numbers that "look" statistically random.) 

This randomized version of quicksort has an interesting property that 
is also possessed by many other randomized algorithms: no particular in- 
put elicits its worst-case behavior. Instead, its worst case depends on the 
random-number generator. Even intentionally, you cannot produce a bad 
input array for quicksort, since the random permutation makes the input 
order irrelevant. The randomized algorithm performs badly only if the 
random-number generator produces an unlucky permutation to be sorted. 
Exercise 13.4-4 shows that almost all permutations cause quicksort to per- 
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form nearly as well as the average case: there are very few permutations 
that cause near-worst-case behavior. 

A randomized strategy is typically useful when there are many ways in 
which an algorithm can proceed but it is difficult to determine a way that 
is guaranteed to be good. If many of the alternatives are good, simply 
choosing one randomly can yield a good strategy. Often, an algorithm 
must make many choices during its execution. If the benefits of good 
choices outweigh the costs of bad choices, a random selection of good and 
bad choices can yield an efficient algorithm. We noted in Section 8.2 that 
a mixture of good and bad splits yields a good running time for quicksort, 
and thus it makes sense that randomized versions of the algorithm should 
perform well. 

By modifying the Partition procedure, we can design another random- 
ized version of quicksort that uses this random-choice strategy. At each 
step of the quicksort algorithm, before the array is partitioned, we ex- 
change element A[p] with an element chosen at random from A[p..r]. 
This modification ensures that the pivot element x = A[p] is equally likely 
to be any of the r - p + 1 elements in the subarray. Thus, we expect the 
split of the input array to be reasonably well balanced on average. The 
randomized algorithm based on randomly permuting the input array also 
works well on average, but it is somewhat more difficult to analyze than 
this version. 

The changes to Partition and Quicksort are small. In the new parti- 
tion procedure, we simply implement the swap before actually partitioning: 

Randomized-Partition(^( ,p,r) 

1 i <- Random(p, r) 

2 exchange A[p] «-> A[i] 

3 return Partition^ , p, r) 

We now make the new quicksort call Randomized-Partition in place of 
Partition: 

Randomized-Quicksort(^, p, r) 

1 ifp<r 

2 then q *- Randomized-Partition^,/?, r) 

3 PvAndomized-Quicksort(,4,/?,#) 

4 Randomized-Quicksort(j4, q + 1 , r) 

We analyze this algorithm in the next section. 

Exercises 

8.3-1 

Why do we analyze the average-case performance of a randomized algo- 
rithm and not its worst-case performance? 
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8.3-2 

During the running of the procedure Randomized-Quicksort, how many 
calls are made to the random-number generator Random in the worst case? 
How does the answer change in the best case? 

8.3-3 * 

Describe an implementation of the procedure Random^, b) that uses only 
fair coin flips. What is the expected running time of your procedure? 

8.3-4 * 

Give a 0(«)-time, randomized procedure that takes as input an array 
A[l . . n] and performs a random permutation on the array elements. 



8.4 Analysis of quicksort 

Section 8.2 gave some intuition for the worst-case behavior of quicksort 
and for why we expect it to run quickly. In this section, we analyze the 
behavior of quicksort more rigorously. We begin with a worst-case analy- 
sis, which applies to cither Quicksort or Randomized-Quicksort, and 
conclude with an average-case analysis of Randomized-Quicksort. 

8.4.1 Worst-case analysis 

We saw in Section 8.2 that a worst-case split at every level of recursion in 
quicksort produces a Q(n 2 ) running time, which, intuitively, is the worst- 
case running time of the algorithm. We now prove this assertion. 

Using the substitution method (see Section 4.1), we can show that the 
running time of quicksort is 0(n 2 ). Let T(n) be the worst-case time for 
the procedure Quicksort on an input of size n. We have the recurrence 

T(n) = ^xJT{q) + nn-q)) + Q{n), (8.1) 

where the parameter q ranges from 1 to n - 1 because the procedure Par- 
tition produces two regions, each having size at least 1. We guess that 
T(n) < cn 2 for some constant c. Substituting this guess into (8.1), we 
obtain 

T{n) < max (cq 2 + c{n - q) 2 ) + 0(«) 

1 

= c - max (q 2 + (n - q) 2 ) + S(n) . 

\<q<n-l 

The expression q 2 + (n-q) 2 achieves a maximum over the range 1 < q < 
n - 1 at one of the endpoints, as can be seen since the second derivative of 
the expression with respect to q is positive (see Exercise 8.4-2). This gives 
us the bound max 1: < 9 < 7i _i(g 2 + (n - q) 2 ) < l 2 + (n - l) 2 = n 2 - 2{n - 1). 
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Continuing with our bounding of T(n), we obtain 
T(n) < cn 2 -2c(n- 1) + 0(«) 
< cn 2 , 

since we can pick the constant c large enough so that the 2c(n - 1) term 
dominates the 8(«) term. Thus, the (worst-case) running time of quicksort 

is e(« 2 ). 

8.4.2 Average-case analysis 

We have already given an intuitive argument why the average-case run- 
ning time of Randomized-Quicksort is G{nlgn): if the split induced 
by Randomized-Partition puts any constant fraction of the elements on 
one side of the partition, then the recursion tree has depth 0(lg«) and 
9(h) work is performed at 0(lg«) of these levels. We can analyze the 
expected running time of Randomized-Quicksort precisely by first un- 
derstanding how the partitioning procedure operates. We can then develop 
a recurrence for the average time required to sort an n-element array and 
solve this recurrence to determine bounds on the expected running time. 
As part of the process of solving the recurrence, we shall develop tight 
bounds on an interesting summation. 

Analysis of partitioning 

We first make some observations about the operation of Partition. When 
Partition is called in line 3 of the procedure Randomized-Partition, 
the element A[p] has already been exchanged with a random element in 
A[p . . r]. To simplify the analysis, we assume that all input numbers are 
distinct. If all input numbers are not distinct, it is still true that quick- 
sort's average-case running time is 0(n lg/i),'but a somewhat more intricate 
analysis than we present here is required. 

Our first observation is that the value of q returned by Partition de- 
pends only on the rank of x = A[p] among the elements in A[p . . r], (The 
rank of a number in a set is the number of elements less than or equal to 
it.) If we let n = r -p + 1 be the number of elements in A[p . . r], swapping 
A[p] with a random element from A[p .. r] yields a probability \/n that 
rank(x) = / for / = l,2,...,n. 

We next compute the likelihoods of the various outcomes of the par- 
titioning. If rank(x) - 1, then the first time through the while loop in 
lines 4-1 1 of Partition, index / stops at i = p and index stops at j = p. 
Thus, when q = j is returned, the "low" side of the partition contains the 
sole element A[p]. This event occurs with probability l/n since that is the 
probability that rank(*) - 1 . 

If rank(x) > 2, then there is at least one element smaller than x = A[p]. 
Consequently, the first time through the while loop, index i stops at = p 
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but j stops before reaching p. An exchange with A\p] is then made to 
put A[p] in the high side of the partition. When Partition terminates, 
each of the rank(x) - 1 elements in the low side of the partition is strictly 
less than x. Thus, for each i = \,2,...,n - 1, when rank(x) > 2, the 
probability is l/n that the low side of the partition has i elements. 

Combining these two cases, we conclude that the size q - p + 1 of the 
low side of the partition is 1 with probability 2/n and that the size is i 
with probability l/n for i = 2, 3, . . . , n - 1. 

A recurrence for the average case 

We now establish a recurrence for the expected running time of Ran- 
domized-Quicksort. Let T(n) denote the average time required to sort 
an h -element input array. A call to Randomized-Quicksort with a 1- 
element array takes constant time, so we have T(l) = 9(1). A call to 
Randomized-Quicksort with an array A[l..n] of length n uses time 
0(«) to partition the array. The Partition procedure returns an index q, 
and then Randomized-Quicksort is called recursively with subarrays of 
length q and n - q. Consequently, the average time to sort an array of 
length n can be expressed as 

T(n) = 1 (t(1) + T(n - 1) + £ (T(q) + T{n - + 9(«) . (8.2) 

The value of q has an almost uniform distribution, except that the value 
q = 1 is twice as likely as the others, as was noted above. Using the facts 
that T(l) = 6(1) and T(n - 1) = 0(» 2 ) from our worst-case analysis, we 
have 

1(7(1) + ^-1)) = i(0(l) + 0(n 2 )), 
= 0(n), 

and the term 0(«) in equation (8.2) can therefore absorb the expression 
£(T(1) + T(n - 1)). We can thus restate recurrence (8.2) as 

i 

q=l 

Observe that for k = 1, 2, ...,«- I, each term T{k) of the sum occurs 
once .as T(q) and once as T(n - q). Collapsing the two terms of the sum 
yields 



k=l 



(8.4) 
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Solving the recurrence 

We can solve the recurrence (8.4) using the substitution method. Assume 
inductively that T(n) <an\gn + b for some constants a > 0 and b > 0 to 
be determined. We can pick a and b sufficiently large so that an\gn + b 
is greater than T{\). Then for n > 1, we have by substitution 

T W = \Y,T{k) + Q{n) 

k=\ 

< -J2(aklgk + b) + e(n) 

k=\ 

= — ]TjHgfc + ^(n-l) + e(n). 
We show below that the summation in the last line can be bounded by 
Xi/clg/c<i« 2 lg«-i« 2 . (8.5) 

k = l 

Using this bound, we obtain 

T(n) < ^Q« 2 lg«-^ 2 ) + ^(H-l) + e(«) 

< anlgn - |« + 2b + &{n) 

= anlgn + b + (&{n) + b - ^n) 

< anlgn + b, 

since we can choose a large enough so that f n dominates 8(«) + b. We 
conclude that quicksort's average running time is 0{n\gn). 

Tight bounds on the key summation 



It remains to prove the bound (8.5) on the summation 
X>lg*. 

Since each term is at most n\gn, we have the bound 

K-l 

J]fclgA:<« 2 lg« , 
k=\ 

which is tight to within a constant factor. This bound is not strong enough 
to solve the recurrence as T(n) = 0{n\gn), however. Specifically, we need 
a bound of {n 1 \%n - Q(n 2 ) for the solution of the recurrence to work out. 

We can get this bound on the summation by splitting it into two parts, 
as discussed in Section 3.2 on page 48. We obtain 
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n-\ r«/21-l «-l 

J2k\gk= ki g k+ Yl kl ^ k - 

k=\ k=i fc=r«/2i 

The lg k in the first summation on the right is bounded above by lg(« /2) = 
lg n - 1 . The lg k in the second summation is bounded above by lg n. Thus, 

n-l r»/21-l n-\ 

Y,k\gk < (i g «-i) £ k+l % n E k 

k=l k=\ k=\n/2] 

n-1 r»/2i-i 

^ 1 2, 1 2 

< -n l \gn--^n 2 
if « > 2. This is the bound (8.5). 

Exercises 

8.4-1 

Show that quicksort's best-case running time is Q(n lg/j). 
8.4-2 

Show that q 2 + (n - q) 2 achieves a maximum over q = 1, 2, . . . , n - 1 when 
q = 1 or q = // - 1. 

Show that Randomized-Quicksort's expected running time is £2(« Ign). 
8.4-4 

The running time of quicksort can be improved in practice by taking ad- 
vantage of the fast running time of insertion sort when its input is "nearly" 
sorted. When quicksort is called on a subarray with fewer than k elements, 
let it simply return without sorting the subarray. After the top-level call to 
quicksort returns, run insertion sort on the entire array to finish the sort- 
ing process. Argue that this sorting algorithm runs in 0{nk + n\g(n/k)) 
expected time. How should k be picked, both in theory and in practice? 

8.4-5 * 

Prove the identity 

J x In x dx = ^x 2 In x - ^x 2 , 

and then use the integral approximation method to give a tighter upper 
bound than (8.5) on the summation Y!k=\ k\gk. 
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8.4-6 * 

Consider modifying the Partition procedure by randomly picking three 
elements from array A and partitioning about their median. Approximate 
the probability of getting at worst an a-to-(l - a) split, as a function of a 
in the range 0 < a < 1. 



Problems 

8-1 Partition correctness 

Give a careful argument that the procedure Partition in Section 8.1 is 
correct. Prove the following: 

a. The indices i and j never reference an element of A outside the interval 
[p..r]. 

b. The index is not equal to r when Partition terminates (so that the 
split is always nontrivial). 

c. Every element of A[p . . j] is less than or equal to every element of A\J '+ 1 
. . r] when Partition terminates. 

8-2 Lomuto's partitioning algorithm 

Consider the following variation of Partition, due to N. Lomuto. To 
partition A\p . . r], this version grows two regions, A[p . . i] and A[i + 1 . . ;'], 
such that every element in the first region is less than or equal to x = A[r] 
and every element in the second region is greater than x. 

Lomuto-Partition(^,/j, r) 

1 x+-A[r] 

2 rVp-1 

3 for j «- p to r 

4 do if A[j] < x 

5 then / «- i + 1 

6 exchange A[i] A[j] 

7 if / < r 

8 then return / 

9 else return i - 1 

a. Argue that Lomuto-Partition is correct. 

b. What arelhe maximum numbers of times that an element can be moved 
by Partition and by Lomuto-Partition? 

c. Argue that Lomuto-Partition, like Partition, runs in &{n) time on 
an ^-element subarray. 
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d. How does replacing Partition by Lomuto-Partition affect the run- 
ning time of Quicksort when all input values are equal? 

e. Define a procedure Randomized-Lomuto-Partition that exchanges 
A[r] with a randomly chosen element in A[p . . r] and then calls Lomuto- 
Partition. Show that the probability that a given value q is returned 
by Randomized-Lomuto-Partition is equal to the probability that 
p + r - q is returned by Randomized-Partition. 

8-3 Stooge sort 

Professors Howard, Fine, and Howard have proposed the following "ele- 
gant" sorting algorithm: 

Stooge-Sort(^4, 

1 \fA[i]>A[j] 

2 then exchange A[i] <-► A[j] 

3 if / + 1 > j 

4 then return 

5 k *- [(j - i + 1)/3J > Round down. 

6 Stooge-SortL4, ij-k) t> First two-thirds. 

7 Stooge-Sort(^, / + k,j) t> Last two-thirds. 

8 Stooge-Sort(^, ij -k) t> First two-thirds again. 

a. Argue that Stooge-Sort(^, \,length[A\) correctly sorts the input array 
A[\ . . n], where n = length[A]. 

b. Give a recurrence for the worst-case running time of Stooge-Sort and 
a tight asymptotic (©-notation) bound on the worst-case running time. 

c. Compare the worst-case running time of Stooge-Sort with that of in- 
sertion sort, merge sort, heapsort, and quicksort. Do the professors 
deserve tenure? 

8-4 Stack depth for quicksort 

The Quicksort algorithm of Section 8.1 contains two recursive calls to it- 
self. After the call to Partition, the left subarray is recursively sorted and 
then the right subarray is recursively sorted. The second recursive call in 
Quicksort is not really necessary; it can be avoided by using an iterative 
control structure. This technique, called tail recursion, is provided auto- 
matically by good compilers. Consider the following version of quicksort, 
which simulates tail recursion. 
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Quicksort' (A,p,r) 

1 while p < r 

2 do > Partition and sort left subarray 

3 q *- Partition^,/*, r) 

4 Quicksort' (A,p,q) 

5 P^q+i 

a. Argue that Quicksort' (A, \,length[A\) correctly sorts the array A. 

Compilers usually execute recursive procedures by using a stack that con- 
tains pertinent information, including the parameter values, for each re- 
cursive call. The information for the most recent call is at the top of the 
stack, and the information for the initial call is at the bottom. When a 
procedure is invoked, its information is pushed onto the stack; when it ter- 
minates, its information is popped. Since we assume that array parameters 
are actually represented by pointers, the information for each procedure 
call on the stack requires 0( 1 ) stack space. The stack depth is the maximum 
amount of stack space used at any time during a computation. 

b. Describe a scenario in which the stack depth of Quicksort' is 0(n) on 
an rt-element input array. 

c. Modify the code for Quicksort' so that the worst-case stack depth is 

e(ig«). 

8-5 Median-of-3 partition 

One way to improve the Randomized-Quicksort procedure is to par- 
tition around an element x that is chosen more carefully than by pick- 
ing a random element from the subarray. One common approach is the 
median-of-3 method: choose x as the median (middle element) of a set of 
3 elements randomly selected from the subarray. For this problem, let us 
assume that the elements in the input array A[l . . n] are distinct and that 
n > 3. We denote the sorted output array by A'[\ . . n\. Using the median- 
of-3 method to choose the pivot element x, define p-, = Pr{x = A'[i]}. 

a. Give an exact formula for p f as a function of n and i for i = 2, 3, . . . , 
n - 1. (Note that pi=p n = 0.) 

b. By what amount have we increased the likelihood of choosing x = 
A'[[(n + 1)/2J], the median of A[l . . «], compared to the ordinary im- 
plementation? Assume that n -» oo, and give the limiting ratio of these 
probabilities. 

c. If we define a "good" split to mean choosing x = A'[i], where n/3 < i < 
In /3, by what amount have we increased the likelihood of getting a good 
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split compared to the ordinary implementation? {Hint: Approximate the 
sum by an integral.) 

d. Argue that the median-of-3 method affects only the constant factor in 
the Q(n Ign) running time of quicksort. 



Chapter notes 



The quicksort procedure was invented by Hoare [98]. Sedgewick [174] 
provides a good reference on the details of implementation and how they 
matter. The advantages of randomized algorithms were articulated by 
Rabin [165]. 
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Sorting in Linear Time 



We have now introduced several algorithms that can sort n numbers in 
0(nlgn) time. Merge sort and heapsort achieve this upper bound in the 
worst case; quicksort achieves it on average. Moreover, for each of these 
algorithms, we can produce a sequence of n input numbers that causes the 
algorithm to run in Q(n lg«) time. 

These algorithms share an interesting property: the sorted order they 
determine is based only on comparisons between the input elements. We 
call such sorting algorithms comparison sorts. All the sorting algorithms 
introduced thus far are comparison sorts. 

In Section 9.1, we shall prove that any comparison sort must make 
Q(«lg«) comparisons in the worst case to sort a sequence of n elements. 
Thus, merge sort and heapsort are asymptotically optimal, and no com- 
parison sort exists that is faster by more than a constant factor. 

Sections 9.2, 9.3, and 9.4 examine three sorting algorithms — counting 
sort, radix sort, and bucket sorWthat run in linear time. Needless to say, 
these algorithms use operations other than comparisons to determine the 
sorted order. Consequently, the £l{nlgn) lower bound does not apply to 
them. 



9.1 Lower bounds for sorting 

In a comparison sort, we use only comparisons between elements to gain 
order information about an input sequence {a\,a2, . . .,a„). That is, given 
two elements a,- and fly, we perform one of the tests fl,- < fly, fl, < fly, 
a-, = fly, fl, > fly, or fl,- > fly to determine their relative order. We may not 
inspect the values of the elements or gain order information about them 
in any other way. 

In this section, we assume without loss of generality that all of the input 
elements are distinct. Given this assumption, comparisons of the form 
a,- = fly are useless, so we can assume that no comparisons of this form are 
made. We also note that the comparisons a, < fly, a, > fly, a,- > fly, and 
a, < fly are all equivalent in that they yield identical information about 
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Figure 9.1 The decision tree for insertion sort operating on three elements. There 
are 3! = 6 possible permutations of the input elements, so the decision tree must 
have at least 6 leaves. 

the relative order of a, and a,. We therefore assume that all comparisons 
have the form a, < aj. 

The decision-tree model 

Comparison sorts can be viewed abstractly in terms of decision trees. A 
decision tree represents the comparisons performed by a sorting algorithm 
when it operates on an input of a given size. Control, data movement, 
and all other aspects of the algorithm are ignored. Figure 9. 1 shows the 
decision tree corresponding to the insertion sort algorithm from Section 1 . 1 
operating on an input sequence of three elements. 

In a decision tree, each internal node is annotated by a, : aj for some i 
and j in the range 1 < i, j < n, where n is the number of elements in the 
input sequence. Each leaf is annotated by a permutation (rc(l), n{2), 
n(n)). (See Section 6.1 for background on permutations.) The execution 
of the sorting algorithm corresponds to tracing a path from the root of 
the decision tree to a leaf. At each internal node, a comparison a, < aj is 
made. The left subtree then dictates subsequent comparisons for a, < aj, 
and the right subtree dictates subsequent comparisons for a\ > aj. When 
we come to a leaf, the sorting algorithm has established the ordering a n ^ < 
a n (2) < • • • < a n („). Each of the n\ permutations on n elements must appear 
as one of the leaves of the decision tree for the sorting algorithm to sort 
properly. 

A lower bound for the worst case 

The length of the longest path from the root of a decision tree to any 
of its leaves represents the worst-case number of comparisons the sorting 
algorithm performs. Consequently, the worst-case number of comparisons 
for a comparison sort corresponds to the height of its decision tree. A 
lower bound on the heights of decision trees is therefore a lower bound on 
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the running time of any comparison sort algorithm. The following theorem 
establishes such a lower bound. 

Theorem 9.1 

Any decision tree that sorts n elements has height Q(«lg«). 

Proof Consider a decision tree of height h that sorts n elements. Since 
there are n\ permutations of n elements, each permutation representing a 
distinct sorted order, the tree must have at least n\ leaves. Since a binary 
tree of height h has no more than 2 h leaves, we have 

nl<2 h , 

which, by taking logarithms, implies 
h > lg(«!) , 

since the lg function is monotonically increasing. From Stirling's approx- 
imation (2.1 1), we have 

->(?)■■ 

where e = 2.71828 ... is the base of natural logarithms; thus 

h 2 <)" 

= n Ig n - n lg e 

= £2(«lg»). ■ 



Corollary 9.2 

Heapsort and merge sort are asymptotically optimal comparison sorts. 

Proof The O(nlgn) upper bounds on the running times for heapsort 
and merge sort match the Q.(nlgn) worst-case lower bound from Theo- 
rem 9.1. B 

Exercises 

9.1-1 

What is the smallest possible depth of a leaf in a decision tree for a sorting 
algorithm? 

9.1-2 

Obtain asymptotically tight bounds on lg(«!) without using Stirling's ap- 
proximation. Instead, evaluate the summation Y!k=\ 'g^ using techniques 
from Section 3.2. 



9.2 Counting sort 



175 



9.1-3 

Show that there is no comparison sort whose running time is linear for at 
least half of the nl inputs of length n. What about a fraction of 1 jn of the 
inputs of length nl What about a fraction 1/2"? 

9.1-4 

Professor Solomon claims that the Q(nlgn) lower bound for sorting n 
numbers does not apply to his computer environment, in which the control 
flow of a program can split three ways after a single comparison a, : aj, 
according to whether a, < aj, a, = aj, or a, > aj. Show that the professor 
is wrong by proving that the number of three-way comparisons required 
to sort n elements is still Q(nl$n). 

9.1-5 

Prove that In - 1 comparisons are necessary in the worst case to merge 
two sorted lists containing n elements each. 

9.1-6 

You are given a sequence of n elements to sort. The input sequence con- 
sists of n/k subsequences, each containing k elements. The elements in 
a given subsequence are all smaller than the elements in the succeeding 
subsequence and larger than the elements in the preceding subsequence. 
Thus, all that is needed to sort the whole sequence of length n is to sort 
the k elements in each of the n/k subsequences. Show an Q(nlgk) lower 
bound on the number of comparisons needed to solve this variant of the 
sorting problem. {Hint: It is not rigorous to simply combine the lower 
bounds for the individual subsequences.) 



9.2 Counting sort 

Counting sort assumes that each of the n input elements is an integer in 
the range 1 to k, for some integer k. When k = 0(n), the sort runs in 
0(n) time. 

The basic idea of counting sort is to determine, for each input element x, 
the number of elements less than x. This information can be used to place 
element x directly into its position in the output array. For example, if 
there are 17 elements less than x, then x belongs in output position 18. 
This scheme must be modified slightly to handle the situation in which 
several elements have the same value, since we don't want to put them all 
in the same position. 

In the code for counting sort, we assume that the input is an array 
A[l . . n], and thus length[A] = n. We require two other arrays: the array 
B[l..n] holds the sorted output, and the array C[l . . k] provides tempo- 
rary working storage. 
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Figure 9.2 The operation of Counting-Sort on an input array A[l . . 8], where 
each element of A is a positive integer no larger than k = 6. (a) The array A and 
the auxiliary array C after line 4. (b) The array C after line 7. (c)-(e) The output 
array B and the auxiliary array C after one, two, and three iterations of the loop 
in lines 9-11, respectively. Only the lightly shaded elements of array B have been 
filled in. (f) The final sorted output array B. 



Counting-Sort(^(, B, k) 

1 for / f- 1 to k 

2 do C[i] <- 0 

3 for j <- 1 to length[A] 

4 do C[A[j]]+-C[A[j]] +l 

5 > C[i] now contains the number of elements equal to i. 

6 for / <- 2 to k 

7 do C[i) «- C[i] + C[i - 1] 

8 > C[/] now contains the number of elements less than or equal to i. 

9 for <- length[A\ downto \ 

10 do B[C[A[j]]] - ^[y] 

11 C[^[/]] - C[AU)) - 1 

Counting sort is illustrated in Figure 9.2. After the initialization in 
lines 1-2, we inspect each input element in lines 3-4. If the value of an 
input element is /, we increment C[i]. Thus, after lines 3-4, C[i] holds 
the number of input elements equal to i for each integer i = 1, 2, . . . , k. In 
lines 6-7, we determine for each i=\,2,..,,k, how many input elements 
are less than or equal to /'; this is done by keeping a running sum of the 
array C. 

Finally, in lines 9-11, we place each element A[j] in its correct sorted 
position in the output array B. If all n elements are distinct, then when 
we first enter line 9, for each A[j], the value C[A\J]] is the correct final 
position of A[j] in the output array, since there are C[A[j]] elements less 
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than or equal to A[j]. Because the elements might not be distinct, we 
decrement CL4[,/]] each time we place a value A[j] into the B array; this 
causes the next input element with a value equal to A[j], if one exists, to 
go to the position immediately before A[j] in the output array. 

How much time does counting sort require? The for loop of lines 1-2 
takes time 0(k), the for loop of lines 3-4 takes time 0{ri), the for loop of 
lines 6-7 takes time O(k), and the for loop of lines 9-1 1 takes time 0{n). 
Thus, the overall time is 0(k + n). In practice, we usually use counting 
sort when we have k = 0(n), in which case the running time is 0{n). 

Counting sort beats the lower bound of Q.(n\%n) proved in Section 9.1 
because it is not a comparison sort. In fact, no comparisons between input 
elements occur anywhere in the code. Instead, counting sort uses the actual 
values of the elements to index into an array. The lg n) lower bound 
for sorting does not apply when we depart from the comparison-sort model. 

An important property of counting sort is that it is stable: numbers with 
the same value appear in the output array in the same order as they do in 
the input array. That is, ties between two numbers are broken by the rule 
that whichever number appears first in the input array appears first in the 
output array. Of course, the property of stability is important only when 
satellite data are carried around with the element being sorted. We shall 
see why stability is important in the next section. 

Exercises 

9.2-1 

Using Figure 9.2 as a model, illustrate the operation of Counting-Sort 
on the array A = (7, 1 , 3, 1 , 2, 4, 5, 7, 2, 4, 3). 

9.2-2 

Prove that Counting-Sort is stable. 
9.2-3 

Suppose that the for loop in line 9 of the Counting-Sort procedure is 
rewritten: 

9 for j «- 1 to length[A] 

Show that the algorithm still works properly. Is the modified algorithm 
stable? 

9.2-4 

Suppose that the output of the sorting algorithm is a data stream such as a 
graphics display. Modify Counting-Sort to produce the output in sorted 
order without using any substantial additional storage besides that in A 
and C. {Hint: Link elements of A that have the same key into lists. Where 
is a "free" place to keep the pointers for the linked list?) 
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9.2-5 

Describe an algorithm that, given n integers in the range 1 to k, prepro- 
cesses its input and then answers any query about how many of the n 
integers fall into a range [a . . b] in 0{\) time. Your algorithm should use 
0{n + k) preprocessing time. 



9.3 Radix sort 

Radix sort is the algorithm used by the card-sorting machines you now find 
only in computer museums. The cards are organized into 80 columns, and 
in each column a hole can be punched in one of 12 places. The sorter can 
be mechanically "programmed" to examine a given column of each card 
in a deck and distribute the card into one of 12 bins depending on which 
place has been punched. An operator can then gather the cards bin by 
bin, so that cards with the first place punched are on top of cards with the 
second place punched, and so on. 

For decimal digits, only 10 places are used in each column. (The other 
two places are used for encoding nonnumeric characters.) A ^ -digit num- 
ber would then occupy a field of d columns. Since the card sorter can look 
at only one column at a time, the problem of sorting n cards on a -digit 
number requires a sorting algorithm. 

Intuitively, one might want to sort numbers on their most significant 
digit, sort each of the resulting bins recursively, and then combine the 
decks in order. Unfortunately, since the cards in 9 of the 1 0 bins must be 
put aside to sort each of the bins, this procedure generates many interme- 
diate piles of cards that must be kept track of. (See Exercise 9.3-5.) 

Radix-sort solves the problem of card sorting counterintuitively by sort- 
ing on the least significant digit first. The cards are then combined into a 
single deck, with the cards in the 0 bin preceding the cards in the 1 bin 
preceding the cards in the 2 bin, and so on. Then the entire deck is sorted 
again on the second least-significant digit and recombined in a like man- 
ner. The process continues until the cards have been sorted on all d digits. 
Remarkably, at that point the cards are fully sorted on the ^ -digit num- 
ber. Thus, only d passes through the deck are required to sort. Figure 9.3 
shows how radix sort operates on a "deck" of seven 3-digit numbers. 

It is essential that the digit sorts in this algorithm be stable. The sort 
performed by a card sorter is stable, but the operator has to be wary about 
not changing the order of the cards as they come out of a bin, even though 
all the cards in a bin have the same digit in the chosen column. 

In a typical computer, which is a sequential random-access machine, 
radix sort is sometimes used to sort records of information that are keyed 
by multiple fields. For example, we might wish to sort dates by three keys: 
year, month, and day. We could run a sorting algorithm with a compar- 
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Figure 9.3 The operation of radix sort on a list of seven 3-digit numbers. The 
first column is the input. The remaining columns show the list after successive 
sorts on increasingly significant digit positions. The vertical arrows indicate the 
digit position sorted on to produce each list from the previous one. 

ison function that, given two dates, compares years, and if there is a tie, 
compares months, and if another tie occurs, compares days. Alternatively, 
we could sort the information three times with a stable sort: first on day, 
next on month, and finally on year. 

The code for radix sort is straightforward. The following procedure 
assumes that each element in the ^-element array A has d digits, where 
digit 1 is the lowest-order digit and digit d is the highest-order digit. 

Radix-Sort^,*/) 

1 for j \^ 1 to d 

2 do use a stable sort to sort array A on digit j 

The correctness of radix sort follows by induction on the column being 
sorted (see Exercise 9.3-3). The analysis of the running time depends on 
the stable sort used as the intermediate sorting algorithm. When each digit 
is in the range 1 to k, and k is not too large, counting sort is the obvious 
choice. Each pass over n d-&\%\\ numbers then takes time Q(n + k). There 
are d passes, so the total time for radix sort is Q{dn + kd). When d is 
constant and k = 0{n), radix sort runs in linear time. 

Some computer scientists like to think of the number of bits in a com- 
puter word as being 0(lg«). For concreteness, let's say that d\gn is the 
number of bits, where d is a positive constant. Then, if each number to 
be sorted fits in one computer word, we can treat it as a t/-digit number 
in radix- « notation. As a concrete example, consider sorting 1 million 
64-bit numbers. By treating these numbers as four-digit, radix-2 16 num- 
bers, we can sort them in just four passes using radix sort. This compares 
favorably with a typical G(n lgn) comparison sort, which requires approx- 
imately lg n = 20 operations per number to be sorted. Unfortunately, the 
version of radix sort that uses counting sort as the intermediate stable sort 
does not sort in place, which many of the Q(nlgn) comparison sorts do. 
Thus, when primary memory storage is at a premium, an algorithm such 
as quicksort may be preferable. 
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Exercises 
9.3-1 

Using Figure 9.3 as a model, illustrate the operation of Radix-Sort on 
the following list of English words: COW, DOG, SEA, RUG, ROW, MOB, 
BOX, TAB, BAR, EAR, TAR, DIG, BIG, TEA, NOW, FOX. 

9.3-2 

Which of the following sorting algorithms are stable: insertion sort, merge 
sort, heapsort, and quicksort? Give a simple scheme that makes any sorting 
algorithm stable. How much additional time and space does your scheme 
entail? 

9.3-3 

Use induction to prove that radix sort works. Where does your proof need 
the assumption that the intermediate sort is stable? 

9.3-4 

Show how to sort n integers in the range 1 to n 2 in 0(n) time. 
9.3-5 * 

In the first card-sorting algorithm in this section, exactly how many sorting 
passes are needed to sort d-digit decimal numbers in the worst case? How 
many piles of cards would an operator need to keep track of in the worst 
case? 



9.4 Bucket sort 

Bucket sort runs in linear time on the average. Like counting sort, bucket 
sort is fast because it assumes something about the input. Whereas count- 
ing sort assumes that the input consists of integers in a small range, bucket 
sort assumes that the input is generated by a random process that dis- 
tributes elements uniformly over the interval [0, 1). (See Section 6.2 for a 
definition of uniform distribution.) 

The idea of bucket sort is to divide the interval [0, 1) into n equal-sized 
subintervals, or buckets, and then distribute the n input numbers into the 
buckets. Since the inputs are uniformly distributed over [0,1), we don't 
expect many numbers to fall into each bucket. To produce the output, we 
simply sort the numbers in each bucket and then go through the buckets 
in order, listing the elements in each. 

Our code for bucket sort assumes that the input is an n -element array A 
and that each element A[i] in the array satisfies 0 < A[i] < 1. The code 
requires an auxiliary array B[0..n - 1] of linked lists (buckets) and as- 
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(a) * (b) 

Figure 9.4 The operation of Bucket-Sort, (a) The input array A[l .. 10]. (b) 
The array B[Q . . 9] of sorted lists (buckets) after line 5 of the algorithm. Bucket / 
holds values in the interval [//10, (r + 1)/ 10). The sorted output consists of a 
concatenation in order of the lists B[0], B[l ],..., B[9]. 



lhA\i]\ 



sumes that there is a mechanism for maintaining such lists. (Section 1 1 .2 
describes how to implement basic operations on linked lists.) 

Bucket-Sort(v4) 

1 n *- length[A] 

2 for /' «- 1 to n 

3 do insert A[i] into list ^[[Aif/]]] 

4 for / «- 0 to n - 1 

5 do sort list B[i] with insertion sort 

6 concatenate the lists £[0], B[l], B[n - 1] together in order 

Figure 9.4 shows the operation of bucket sort on an input array of 10 
numbers. 

To see that this algorithm works, consider two elements A[i] and A[j]. 
If these elements fall in the same bucket, they appear in the proper relative 
order in the output sequence because their bucket is sorted by insertion 
sort. Suppose they fall into different buckets, however. Let these buckets 
be B[i'\ and B[f], respectively, and assume without loss of generality that 
/' < /. When the lists of B are concatenated in line 6, elements of bucket 
B[i'] come before elements of B[j'], and thus A[i] precedes A[j] in the 
output sequence. Hence, we must show that A[i] < A[j]. Assuming the 
contrary, we have 

i' = [nA[i]\ 
> [nA[fW 

= f > 

which is a contradiction, since /' < /. Thus, bucket sort works. 
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To analyze the running time, observe that all lines except line 5 take 
0{n) time in the worst case. The total time to examine all buckets in 
line 5 is 0(n), and so the only interesting part of the analysis is the time 
taken by the insertion sorts in line 5. 

To analyze the cost of the insertion sorts, let «, be the random variable 
denoting the number of elements placed in bucket B[i]. Since insertion 
sort runs in quadratic time (see Section 1 .2), the expected time to sort the 
elements in bucket B[i] is E [0{nj)] = 0(E [«?]). The total expected time 
to sort all the elements in all the buckets is therefore 



In order to evaluate this summation, we must determine the distribution 
of each random variable We have n elements and n buckets. The 
probability that a given element falls into bucket B[i] is l/n, since each 
bucket is responsible for l/n of the interval [0, 1). Thus, the situation is 
analogous to the ball-tossing example of Section 6.6.2: we have n balls 
(elements) and n bins (buckets), and each ball is thrown independently 
with probability p = l/n of falling into any particular bucket. Thus, the 
probability that m = k follows the binomial distribution b(k;n,p), which 
has mean E[« ( ] = np = 1 and variance Var[fl,] = np{l - p) - 1 - l/n. 
For any random variable X, equation (6.30) gives 

E[nf] = Var[«,] + E 2 [H,] 



Using this bound in equation (9.1), we conclude that the expected time 
for insertion sorting is 0{n). Thus, the entire bucket sort algorithm runs 
in linear expected time. 

Exercises 

9.4-1 

Using Figure 9.4 as a model, illustrate the operation of Bucket-Sort on 
the array A = (.79, .13, .16, .64, .39, .20, .89, .53, .71, .42). 



What is the worst-case running time for the bucket-sort algorithm? What 
simple change to the algorithm preserves its linear expected running time 
and makes its worst-case running time 0(nlgn)1 




(9.1) 



9.4-2 
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9.4-3 * 

We are given n points in the unit circle, Pi - (x,-,y,-)> sucn mat 0 < 
+>/ < 1 for i = \,2,...,n. Suppose that the points are uniformly 
distributed; that is, the probability of finding a point in any region of the 
circle is proportional to the area of that region. Design a 8(n) expected- 
time algorithm to sort the n points by their distances dj = yjxf + y? from 
the origin. (Hint: Design the bucket sizes in Bucket-Sort to reflect the 
uniform distribution of the points in the unit circle.) 

9.4-4 * 

A probability distribution function P(x) for a random variable X is defined 
by P(x) = Pr{X < x}. Suppose a list of n numbers has a continuous 
probability distribution function P that is computable in 0(1) time. Show 
how to sort the numbers in linear expected time. 



Problems 

9-1 Average-case lower bounds on comparison sorting 
In this problem, we prove an Q(n \gn) lower bound on the expected run- 
ning time of any deterministic or randomized comparison sort on n inputs. 
We begin by examining a deterministic comparison sort A with decision 
tree T A . We assume that every permutation of A's inputs is equally likely. 

a. Suppose that each leaf of T A is labeled with the probability that it is 
reached given a random input. Prove that exactly «! leaves are labeled 
l/n\ and that the rest are labeled 0. 

b. Let D{T) denote the external path length of a tree T; that is, D(T) is 
the sum of the depths of all the leaves, of T. Let T be a tree with k > 1 
leaves, and let RT and LT be the right and left subtrees of T. Show 
that D{ T) = D{RT) + D(LT) + k. 

c. Let d(m) be the minimum value of D(T) over all trees T with m leaves. 
Show that d(k) = min,< ( <* {d(i) + d(k - i) + k}. {Hint: Consider a 
tree T with k leaves that achieves the minimum. Let i be the number 
of leaves in RT and k - i the number of leaves in LT.) 

d. Prove that for a given value of k, the function i lg i + (k - i) lg(k - i) is 
minimized at /' = k/2. Conclude that d(k) - Q(fclgfc). 

e. Prove that D(T A ) - Q(/j!lg(«!)) for T A , and conclude that the expected 
time to sort n elements is Q(n \gn). 

Now, consider a randomized comparison sort B. We can extend the 
decision-tree model to handle randomization by incorporating two kinds 
of nodes: ordinary comparison nodes and "randomization" nodes. A ran- 
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10 Medians and Order Statistics 



The ith order statistic of a set of n elements is the ith smallest element. 
For example, the minimum of a set of elements is the first order statistic 
(i = 1), and the maximum is the nth order statistic (/ = n). A median, 
informally, is the "halfway point" of the set. When n is odd, the median 
is unique, occurring at / = (n + l)/2. When n is even, there are two 
medians, occurring at i = n/2 and / = n/2 + 1. Thus, regardless of the 
parity of n, medians occur at i = [(n + 1)/2J and i = \(n + l)/2]. 

This chapter addresses the problem of selecting the ith order statistic 
from a set of n distinct numbers. We assume for convenience that the set 
contains distinct numbers, although virtually everything that we do extends 
to the situation in which a set contains repeated values. The selection 
problem can be specified formally as follows: 

Input: A set A of n (distinct) numbers and a number i, with 1 < /' < n. 
Output: The element x e A that is larger than exactly /' - 1 other elements 
of A. 

The selection problem can be solved in O(nlgn) time, since we can sort 
the numbers using heapsort or merge sort and then simply index the ith 
element in the output array. There are faster algorithms, however. 

In Section 10.1, we examine the problem of selecting the minimum and 
maximum of a set of elements. More interesting is the general selection 
problem, which is investigated in the subsequent two sections. Section 1 0.2 
analyzes a practical algorithm that achieves an 0{n) bound On the running 
time in the average case. Section 10.3 contains an algorithm of more 
theoretical interest that achieves the O(n) running time in the worst case. 



10.1 Minimum and maximum 

How many comparisons are necessary to determine the minimum of a set 
of n elements? We can easily obtain an upper bound of n - 1 comparisons: 
examine each element of the set in turn and keep track of the smallest 
element seen so far. In the following procedure, we assume that the set 
resides in array A, where length[A] = n. 
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Minimum(^) 

1 min+-A[\] 

2 for / *- 2 to length[A] 

3 do if min > A[i] 

4 then min <- A[i] 

5 return min 

Finding the maximum can, of course, be accomplished with n - 1 com- 
parisons as well. 

Is this the best we can do? Yes, since we can obtain a lower bound of 
n - 1 comparisons for the problem of determining the minimum. Think 
of any algorithm that determines the minimum as a tournament among 
the elements. Each comparison is a match in the tournament in which the 
smaller of the two elements wins. The key observation is that every element 
except the winner must lose at least one match. Hence, n - 1 comparisons 
are necessary to determine the minimum, and the algorithm Minimum is 
optimal with respect to the number of comparisons performed. 

An interesting fine point of the analysis is the determination of the ex- 
pected number of times that line 4 is executed. Problem 6-2 asks you to 
show that this expectation is 6(lg«). 

Simultaneous minimum and maximum 

In some applications, we must find both the minimum and the maximum 
of a set of n elements. For example, a graphics program may need to 
scale a set of (x,y) data to fit onto a rectangular display screen or other 
graphical output device. To do so, the program must first determine the 
minimum and maximum of each coordinate. 

It is not too difficult to devise an algorithm that can find both the min- 
imum and the maximum of n elements using the asymptotically optimal 
number of comparisons. Simply find the minimum and maximum 
independently, using n - 1 comparisons for each, for a total of In - 2 
comparisons. 

In fact, only 3 \n/2] comparisons are necessary to find both the min- 
imum and the maximum. To do this, we maintain the minimum and 
maximum elements seen thus far. Rather than processing each element 
of the input by comparing it against the current minimum and maximum, 
however, at a cost of two comparisons per element, we process elements in 
pairs. We compare pairs of elements from the input first with each other, 
and then compare the smaller to the current minimum and the larger to the 
current maximum, at a cost of three comparisons for every two elements. 
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Exercises 
10.1-1 

Show that the second smallest of n elements can be found with n + [lg n] - 2 
comparisons in the worst case. (Hint: Also find the smallest element.) 

10.1-2 * 

Show that \3n/2] - 2 comparisons are necessary in the worst case to find 
both the maximum and minimum of n numbers. {Hint: Consider how 
many numbers are potentially either the maximum or minimum, and in- 
vestigate how a comparison affects these counts.) 



10.2 Selection in expected linear time 

The general selection problem appears more difficult than the simple prob- 
lem of finding a minimum, yet, surprisingly, the asymptotic running time 
for both problems is the same; Q(n). In this section, we present a divide- 
and-conquer algorithm for the selection problem. The algorithm Ran- 
domized-Select is modeled after the quicksort algorithm of Chapter 8. 
As in quicksort, the idea is to partition the input array recursively. But 
unlike quicksort, which recursively processes both sides of the partition, 
Randomized-Select only works on one side of the partition. This differ- 
ence shows up in the analysis: whereas quicksort has an expected running 
time of 0(/zlgn), the expected time of Randomized-Select is Q(n). 

Randomized-Select uses the procedure Randomized-Partition in- 
troduced in Section 8.3. Thus, like Randomized-Quicksort, it is a ran- 
domized algorithm, since its behavior is determined in part by the output 
of a random-number generator. The following code for Randomized- 
Select returns the ith smallest element of the array A[p . . r]. 

Randomized-Select(4, p, r, i) 

1 i(p = r 

2 then return A\p] 

3 q <- Randomized-Partition^,/?,/-) 

4 k <-#-/> + 1 

5 ifi<k 

6 then return Randomized-Select(v4,/>, q, i) 

7 . else return Randomized-Select(^, q + 1 , r, i - k) 

After Randomized-Partition is executed in line 3 of the algorithm, the 
array A[p . . r] is partitioned into two nonempty subarrays A[p . . q] and 
A[q + 1 . . r] such that each element of A[p ..q]is less than each element of 
A[q+l..r]. Line 4 of the algorithm computes the number k of elements 
in the subarray A[p . . q]. The algorithm now determines in which of the 
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two subarrays A[p . . q] and A[q + 1 . .r] the ilh smallest element lies. If 
i < k, then the desired element lies on the low side of the partition, and 
it is recursively selected from the subarray in line .6. If i > k, however, 
then the desired element lies on the high side of the partition. Since we 
already know k values that are smaller than the ilh smallest element of 
^ [P r]— namely, the elements of A\p..q]—lhe desired element is the 
(i - k)lh smallest element of A[q + 1 . . r], which is found recursively in 
line 7. 

The worst-case running time for Randomized-Select is Q(n 2 ), even 
to find the minimum, because we could be extremely unlucky and always 
partition around the largest remaining element. The algorithm works well 
in the average case, though, and because it is randomized, no particular 
input elicits the worst-case behavior. 

We can obtain an upper bound T(n) on the expected time required by 
Randomized-Select on an input array of n elements as follows. We 
observed in Section 8.4 that the algorithm Randomized-Partttion pro- 
duces a partition whose low side has 1 element with probability 2/n and / 
elements with probability \/n for i = 2, 3, . . . , n - 1. Assuming that T(n) 
is monotonically increasing, in the worst case Randomized-Select is al- 
ways unlucky in that the zth element is determined to be on the larger side 
of the partition. Thus, we gel the recurrence 



If n is odd, each term T{\n/2~}), T(\n/2] + 1 ),..., T{n - 1) appears twice 
in the summation, and if n is even, each term T{\n/2] + 1), T(\n/2] + 2), 
. . . , T{n - 1) appears twice and.the term T(\n/2]) appears once. In either 
case, the summation of the first line is bounded from above by the sum- 
mation of the second line. The third line follows from the second since 
in the worst case T(n - 1) = 0(n 2 ), and thus the term ±T{n - 1) can be 
absorbed by the term O(n). 

We solve the recurrence by substitution. Assume that T{n) < cn for 
some constant c that satisfies the initial conditions of the recurrence. Using 



I 

I 

! 




n *=r»/2i 

The second line follows from the first since max(l, n - 1) = n - 1 and 



= \ E Wc) + 0(n). 
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this inductive hypothesis, we have 
T{n) < - ck + 0{n) 

" k=[n/2] 

lr fn-\ r«/2]-l \ 

_< cC.l,.£g.l)g) +0W 

= *{|»- 5) +0(11) 

< c« , 

since we can pick c large enough so that c(n/4 + 1/2) dominates the 0{n) 
term. 

Thus, any order statistic, and in particular the median, can be deter- 
mined on average in linear time. 

Exercises 

10.2-1 

Write an iterative version of Randomized-Select. 
10.2-2 

Suppose we use Randomized-Select to select the minimum element of 
the array A = (3,2,9,0,7,5,4,8,6, 1). Describe a sequence of partitions 
that results in a worst-case performance of Randomized-Select. 

10.2-3 

Recall that in the presence of equal elements, the Randomized-Partition 
procedure partitions the subarray A[p . . r] into two nonempty subarrays 
A[p..q] and A\q + 1 . . r] such that each element in A[p . . q] is less than or 
equal to every element in A[q + 1 . . r]. If equal, elements are present, does 
the Randomized-Select procedure work correctly? 



10.3 Selection in worst-case linear time 

We now examine a selection algorithm whose running time is 0{n) in the 
worst case. Like Randomized-Select, the algorithm Select finds the de- 
sired element by recursively partitioning the input array. The idea behind 
the algorithm, however, is to guarantee a good split when the array is par- 
titioned. Select uses the deterministic partitioning algorithm Partition 
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Figure 10.1 Analysis of the algorithm Select. The n elements are represented 
by small circles, and each group occupies a column. The medians of the groups 
are whitened, and the median-of-medians x is labeled. Arrows are drawn from 
larger elements to smaller, from which it can be seen that 3 out of every group 
of 5 elements to the right of x are greater than x, and 3 out of every group of 5 
elements to the left of x are less than x. The elements greater than x are shown 
on a shaded background. 

from quicksort (see Section 8.1), modified to take the element to partition 
around as an input parameter. 

The Select algorithm determines the /th smallest of an input array of 
n elements by executing the following steps. 

1 . Divide the n elements of the input array into [n/5j groups of 5 elements 
each and at most one group made up of the remaining n mod 5 elements. 

2. Find the median of each of the \n/5\ groups by insertion sorting the 
elements of each group (of which there are 5 at most) and taking its 
middle element. (If the group has an even number of elements, take the 
larger of the two medians.) 

3. Use Select recursively to find the median x of the \n/5~\ medians found 
in step 2. 

4. Partition the input array around the median-of-medians x using a mod- 
ified version of Partition. Let k be the number of elements on the 
low side of the partition, so that n - k is the number of elements on the 
high side. 

5. Use Select recursively to find the ?th smallest element on the low side 
if / < k, or the (/ - k)\h smallest element on the high side if i > k. 
To analyze the running time of Select, we first determine a lower bound 

on the number of elements that are greater than the partitioning element x. 
Figure 10.1 is helpful in visualizing this bookkeeping. At least half of 
the medians found in step 2 are greater than or equal to the median-of- 
medians x. Thus, at least half of the fn/5] groups contribute 3 elements 
that are greater than x, except for the one group that has fewer than 5 
elements if 5 does not divide n exactly, and the one group containing x 
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itself. Discounting these two groups, it follows that the number of elements 
greater than x is at least 



Similarly, the number of elements that are less than jc is at least 3«/10— 6. 
Thus, in the worst case, Select is called recursively on at most In/ 10 + 6 
elements in step 5. 

We can now develop a recurrence for the worst-case running time T(ri) 
of the algorithm Select. Steps 1, 2, and 4 take 0(n) time. (Step 2 consists 
of 0(n) calls of insertion sort on sets of size 0(1).) Step 3 takes time 
T(\n/5]), and step 5 takes time at most T(7«/10 + 6), assuming that T is 
monotonically increasing. Note that 7w/10 + 6 < n for n > 20 and that 
any input of 80 or fewer elements requires. 0(1) time. We can therefore 
obtain the recurrence 



We show that the running time is linear by substitution. Assume that 
T(n) < cn for some constant c and all n < 80. Substituting this inductive 
hypothesis into the right-hand side of the recurrence yields 

T(n) < c\n/5]+c(7n/\0+6) + O(n) 

< cn/5 + c + lcn/lO + 6c + 0(n) 

< 9cn/\0 + lc + O(n) 

< cn , 

since we can pick c large enough so that c(n/ 10 - 7) is larger than the 
function described by the 0{n) term for all n > 80. The worst-case running 
time of Select is therefore linear. 

As in a comparison sort (see Section 9.1), Select and Randomized- 
Select determine information about the relative order of elements only 
by comparing elements. Thus, the linear-time behavior is not a result of 
assumptions about the input, as was the case for the sorting algorithms in 
Chapter 9. Sorting requires Q(nlgn) time in the comparison model, even 
on average (see Problem 9-1), and thus the method of sorting and indexing 
presented in the introduction to this chapter is asymptotically inefficient. 



In the algorithm Select, the input elements are divided into groups of 5. 
Will the algorithm work in linear time if they are divided into groups of 7? 
How about groups of 3? 




(n)S \T([n/5]) + T(7, 



if n < 80 , 
'n/lO + 6) + 0(n) if /i > 80 . 



Exercises 



10,3-1 
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10.3-2 

Analyze Select to show that the number of elements greater than the 
median-of*medians x and the number of elements less than x is at least 
fn/41 if n > 38. 

10.3-3 

Show how quicksort can be made to run in 0(n\gn) time in the worst 
case. 

10.3-4 * 

Suppose that an algorithm uses only comparisons to find the ith smallest 
element in a set of n elements. Show that it can also find the i - 1 smaller 
elements and the n - i larger elements without performing any additional 
comparisons. 

10.3-5 

Given a "black-box" worst-case linear-time median subroutine, give a sim- 
ple, linear-time algorithm that solves the selection problem for an arbitrary 
order statistic. 

10.3-6 

The kth quantiles of an ^-element set are the k - 1 order statistics that 
divide the sorted set into k equal-sized sets (to within 1 ). Give an 0(n lg k)- 
timc algorithm to list the fcth quantiles of a set. 

10.3-7 

Describe an 0(«)-time algorithm that, given a set S of n distinct numbers 
and a positive integer k < n, determines the k numbers in S that are 
closest to the median of S. 

10.3-8 

Let X[l . . «] and Y[l . . n] be two arrays, each containing n numbers al- 
ready in sorted order. Give an 6>(lg«)-time algorithm to find the median 
of all In elements in arrays X and Y. 

10.3-9 

Professor Olay is consulting for an oil company, which is planning a large 
pipeline running east to west through an oil field of n wells. From each 
well, a spur pipeline is to be connected directly to the main pipeline along 
a shortest path (either north or south), as shown in Figure 10.2. Given x- 
and y : coordinates of the wells, how should the professor pick the optimal 
location of the main pipeline (the one that minimizes the total length of 
the spurs)? Show that the optimal location can be determined in linear 
time. 
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Figure 10.2 We want to determine the position of the east- west oil pipeline that 
minimizes the total length of the north-south spurs. 



Problems 

10-1 Largest i numbers in sorted order 

Given a set of n numbers, we wish to find the / largest in sorted order 
using a comparison-based algorithm. Find the algorithm that implements 
each of the following methods with the best asymptotic worst-case running 
time, and analyze the running times of the methods in terms of n and i. 

a. Sort the numbers and list the i largest. 

b. Build a priority queue from the numbers and call Extract-Max i 
times. 

c. Use an order-statistic algorithm to find the z'th largest number, partition, 
and sort the / largest numbers. 

10-2 Weighted median 

For n distinct elements X\ , x 2 , . . . , x„ with positive weights W\,w 2 ,...,w n 
such that £" =1 Wi- 1, the weighted median is the element Xk satisfying 

x t <x k 

and 

Xi>X k 

a. Argue that the median of x x , x 2 , . . . , x„ is the weighted median of the x, 
with weights Wj = l/n for /' = 1, 2, . . . , n. 
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b. Show how to compute the weighted median of n elements in 0(n lg«) 
worst-case time using sorting. 

c. Show how to compute the weighted median in 8(«) worst-case time 
using a linear-time median algorithm such as Select from Section 10.3. 

The post-office location problem is defined as follows. We are given n 
points p\,Pi,...,p n with associated weights w x , w 2 , . . . ,w n . We wish to 
find a point p (not necessarily one of the input points) that minimizes the 
sum J2"=i Wi d(p,Pi), where d(a, b) is the distance between points a and b. 

d. Argue that the weighted median is a best solution for the 1 -dimensional 
post-office location problem, in which points are simply real numbers 
and the distance between points a and b is d{a,b) = \a- b\. 

e. Find the best solution for the 2-dimensional post-office location prob- 
lem, in which the points are (x,y) coordinate pairs and the distance 
between points a = (x\,y{) and b = [X2,yi) is the Manhattan distance: 

d(a,b) = \xi-x 2 \ + \yi-y 2 \. 

10-3 Small order statistics 

The worst-case number T(n) of comparisons used by Select to select the 
/th order statistic from n numbers was shown to satisfy T(n) - 0(«)> but 
the constant hidden by the 6-notation is rather large. When / is small 
relative to n, we can implement a different procedure that uses Select as 
a subroutine but makes fewer comparisons in the worst case. 

a. Describe an algorithm that uses t/,(n) comparisons to find the z'th small- 
est of n elements, where i < n/2 and 



(Hint: Begin with [n/2] disjoint pairwise comparisons, and recurse on 
the set containing the smaller element from each pair.) 

b. Show that t7,(n) = n + 0(T(2i)lg(n/i)). 

c. Show that if / is a constant, then £/,(«) = n + 0(lgn). 



The worst-case median-finding algorithm was invented by Blum, Floyd, 
Pratt, Rivest, and Tarjan [29]. The fast average-time version is due to 
Hoare [97]. Floyd and Rivest [70] have developed an improved average- 
time version that partitions around an element recursively selected from a 
small sample of the elements. 




d. Show that if i = n/k for k>2, then £/,•(») = n + 0{T(2n/k)\gk). 
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Clustering is the unsupervised classification of patterns (observations, data items, 
or feature vectors) into groups (clusters). The clustering problem has been 
addressed in many contexts and by. researchers in many disciplines; this reflects its 
broad appeal and usefulness as one of the steps in exploratory data analysis. 
However, clustering is a difficult problem combinatorially, and differences in 
assumptions and contexts in different communities has made the transfer of useful 
generic concepts and methodologies slow to occur. This paper presents an overview 
of pattern clustering methods from a statistical pattern recognition perspective, 
with a goal of providing useful advice and references to fundamental concepts 
accessible to the broad community of clustering practitioners. We present a 
taxonomy of clustering techniques, and identify cross-cutting themes and recent 
advances. We also describe some important applications of clustering algorithms 
such as image segmentation, object recognition, and information retrieval. 

Categories and Subject Descriptors: 1.5.1 [Pattern Recognition]: Models; 1.5.3 
[Pattern Recognition]: Clustering; 1.5.4 [Pattern Recognition]: Applications- 
Computer vision; H.3.3 [Information Storage and Retrieval]: Information 
Search and Retrieval— Clustering; 1.2.6 [Artificial Intelligence]: 
Learning— Knowledge acquisition 
General Terms: Algorithms 

Additional Key Words and Phrases: Cluster analysis, clustering applications, 
exploratory data analysis, incremental clustering, similarity indices, unsupervised 
learning 



Section 6.1 is based on the chapter "Image Segmentation Using Clustering" by A.K. Jain and P.J. 
Flynn, Advances in Image Understanding: A Festschrift for Azriel Rosenfeld (K. Bowyer and N. Ahuja, 
Eds.), 1996 IEEE Computer Society Press, and is used by permission of the IEEE Computer Society. 
Authors' addresses: A. Jain, Department of Computer Science, Michigan State University, A714 Wells 
Hall, East Lansing, MI 48824; M. Murty, Department of Computer Science and Automation Indian 
Institute of Science, Bangalore, 560 012, India; P. Flynn, Department of Electrical Engineering, The 
Ohio State University, Columbus, OH 43210. 

Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted 
without fee provided that the copies are not made or distributed for profit or commercial advantage, the 
copyright notice, the title of the publication, and its date appear, and notice is given that copying is by 
permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to 
lists, requires prior specific permission and/or a fee. 
© 2000 ACM 0360-0300/99/0900-0001 $5.00 



ACM Computing Surveys, Vol. 31, No. 3, September 1999 



Data Clustering • 265 



CONTENTS 

1. Introduction 

1.1 Motivation 

1.2 Component of a Clustering Task 

1.3 The User's Dilemma and the Role of Expertise 

1.4 History 

1.5 Outline 

2. Definitions and Notation 

3. Pattern Representation, Feature Selection and 
Extraction 

4. Similarity Measures 
B. Clustering Techniques 

5.1 Hierarchical Clustering Algorithms 

5.2 Partitional Algorithms 

5.3 Mixture-Resolving and Mode-Seeking 
Algorithms 

6.4 Nearest Neighbor Clustering 

5.6 Fuzzy Clustering 

5.6 Representation of Clusters 

5.7 Artificial Neural Networks for Clustering 

6.8 Evolutionary Approaches for Clustering 

5.9 Search-Based Approaches 

5.10 A Comparison of Techniques 

5.11 Incorporating Domain Constraints in 
Clustering 

5.12 Clustering Large Data Sets 

6. Applications 

6.1 Image Segmentation Using Clustering 

6.2 Object and Character Recognition 

6.3 Information Retrieval 

6.4 Data Mining 

7. Summary 



1. INTRODUCTION 
1.1 Motivation 

Data analysis underlies many comput- 
ing applications, either in a design 
phase or as part of their on-line opera- 
tions. Data analysis procedures can^be 
dichotomized as either exploratory or 
confirmatory, based on the availability 
of appropriate models for the data 
source, but a key element in both types 
of procedures (whether for hypothesis 
formation or decision-making) is the 
grouping, or classification of measure- 
ments based on either (i) goodness-of-fit 
to a postulated model, or (ii) natural 
groupings (clustering) revealed through 
analysis. Cluster analysis is the organi- 
zation of a collection of patterns (usual- 
ly represented as a vector of measure- 
ments, or a point in a multidimensional 
space) into clusters based on similarity. 



Intuitively, patterns within a valid clus- 
ter are more similar to each other than 
they are to a pattern belonging to a 
different cluster. An example of cluster- 
ing is depicted in Figure 1. The input 
patterns are shown in Figure 1(a), and 
the desired clusters are shown in Figure 
1(b). Here, points belonging to the same 
cluster are given the same label. The 
variety of techniques for representing 
data, measuring proximity (similarity) 
between data elements, and grouping 
data elements has produced a rich and 
often confusing assortment of clustering 
methods. 

It is important to understand the dif- 
ference between clustering (unsuper- 
vised classification) and discriminant 
analysis (supervised classification). In 
supervised classification, we are pro- 
vided with a collection of labeled (pre- 
classified) patterns; the problem is to 
label a newly encountered, yet unla- 
beled, pattern. Typically, the given la- 
beled {training) patterns are used to 
learn the descriptions of classes which 
in turn are used to label a new pattern. 
In the case of clustering, the problem is 
to group a given collection of unlabeled 
patterns into meaningful clusters. In a 
sense, labels are associated with clus- 
ters also, but these category labels are 
data -'driven; that is, they are obtained 
solely from the data. 

Clustering is useful in several explor- 
atory pattern-analysis, grouping, deci- 
sion-making, and machine-learning sit- 
uations, including data mining, 
document retrieval, image segmenta- 
tion, and pattern classification. How- 
ever, in many such problems, there is 
little prior information (e.g., statistical 
models) available about the data, and 
the decision-maker must make as few 
assumptions about the data as possible. 
It is under these restrictions that clus- 
tering methodology is particularly ap- 
propriate for the exploration of interre- 
lationships among the data points to 
make an assessment (perhaps prelimi- 
nary) of their structure. 

The term "clustering" is used in sev- 
eral research communities to describe 
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Figure 1. Data clustering. 



methods for grouping of unlabeled data. 
These communities have different ter- 
minologies and assumptions for the 
components of the clustering process 
and the contexts in which clustering is 
used. Thus, we face a dilemma regard- 
ing the scope of this survey. The produc- 
tion of a truly comprehensive survey 
would be a monumental task given the 
sheer mass of literature in this area. 
The accessibility of the survey might 
also be questionable given the need, to 
reconcile very different vocabularies 
and assumptions regarding clustering 
in the various communities. 

The goal of this paper is to survey the 
core concepts and techniques in the 
large subset of cluster analysis with its 
roots in statistics and decision theory. 
Where appropriate, references will be 
made to key concepts and techniques 
arising from clustering methodology in 
the machine-learning and other commu- 
nities. 

The audience for this paper includes 
practitioners in the pattern recognition 
and image analysis communities (who 
should view it as a summarization of 
current practice), practitioners in the 
machine-learning communities (who 
should view it as a snapshot of a closely 
related field with a rich history of well- 
understood techniques), and the 
broader audience of scientific profes- 



sionals (who should view it as an acces- 
sible introduction to a mature field that 
is making important contributions to 
computing application areas). 

1.2 Components of a Clustering Task 

Typical pattern clustering activity in- 
volves the following steps [Jain and 
Dubes 1988]: 

(1) pattern representation (optionally 
including feature extraction and/or 
selection), 

(2) definition of a pattern proximity 
measure appropriate to the data do- 
main, 

(3) clustering or grouping, 

(4) data abstraction (if needed), and 

(5) assessment of output (if needed). 
Figure 2 depicts a typical sequencing of 
the first three of these steps, including 
a feedback path where the grouping 
process output could affect subsequent 
feature extraction and similarity com- 
putations. 

Pattern representation refers to the 
number of classes, the number of avail- 
able patterns, and the number, type, 
and scale of the features available to the 
clustering algorithm. Some of. this infor- 
mation may not be controllable by the 
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practitioner. Feature selection is the 
process of identifying the most effective 
subset of the original features to use in 
clustering. Feature extraction is the use 
of one or more transformations of the 
input features to produce new salient 
features. Either or both of these tech- 
niques can be used to obtain an appro- 
priate set of features to use in cluster- 
ing. 

Pattern proximity is usually measured 
by a distance function defined on pairs 
of patterns. A variety of distance mea- 
sures are in use in the various commu- 
nities [Anderberg 1973; Jain and Dubes 
1988; Diday and Simon 1976]. A simple 
distance measure like Euclidean dis- 
tance can often be used to reflect dis- 
similarity between two patterns, 
whereas other similarity measures can 
be used to characterize the conceptual 
similarity between patterns [Michalski 
and Stepp 1983]. Distance measures are 
discussed in Section 4. 

The grouping step can be performed 
in a number of ways. The output clus- 
tering (or clusterings) can be hard (a 
partition of the data into groups) or 
fuzzy (where each pattern has a vari- 
able degree of membership in each of 
the output clusters). Hierarchical clus- 
tering algorithms produce a nested se- 
ries of partitions based on a criterion for 
merging or splitting clusters based on 
similarity; Partitional clustering algo- 
rithms identify the partition that opti- 
mizes (usually locally) a clustering cri- 
terion. Additional techniques for the 
grouping operation include probabilistic 
[Brailovski 1991] and graph-theoretic 
[Zahn 1971] clustering methods. The 
variety of techniques for cluster forma- 
tion is described in Section 5. 



Data abstraction is the process of ex- 
tracting a simple and compact represen- 
tation of a data set. Here, simplicity , is 
either from the perspective of automatic 
analysis (so that a machine can perform 
further processing efficiently) or it is 
human-oriented (so that the representa- 
tion obtained is easy to comprehend and 
intuitively appealing). In the clustering 
context, a typical data abstraction is a 
compact description of each cluster, 
usually in terms of cluster prototypes or 
representative patterns such as the cen- 
troid [Diday and Simon 1976], 

How is the output of a clustering algo- 
rithm evaluated? What characterizes a 
'good' clustering result and a 'poor' one? 
All clustering algorithms will, when 
presented with data, produce clusters — 
regardless of whether the data contain 
clusters or not. If the data does contain 
clusters, some clustering algorithms 
may obtain 'better' clusters than others. 
The assessment of a clustering proce- 
dure's output, then, has several facets. 
One is actually an assessment of the 
data domain rather than the clustering 
algorithm itself— data which do not 
contain clusters should not be processed 
by a clustering algorithm. The study of 
cluster tendency, wherein the input data 
are examined to see if there is any merit 
to a cluster analysis prior to one being 
performed, is a relatively inactive re- 
search area, and will not be considered 
further in this survey. The interested 
reader is referred to Dubes [1987] and 
Cheng [1995] for information. 

Cluster validity analysis, by contrast, 
is the assessment of a clustering proce- 
dure's output. Often this analysis uses a 
specific criterion of optimality; however, 
these criteria are usually arrived at 
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subjectively. Hence, little in the way of 
'gold standards' exist in clustering ex- 
cept in well-prescribed subdomains. Va- 
lidity assessments are objective [Dubes 
1993] and are performed to determine 
whether the output is meaningful. A 
clustering structure is valid if it cannot 
reasonably have occurred by chance or 
as an artifact of a clustering algorithm. 
When statistical approaches to cluster- 
ing are used, validation is accomplished 
by carefully applying statistical meth- 
ods and testing hypotheses. There are 
three types of validation studies. An 
external assessment of validity com- 
pares the recovered structure to an a 
priori structure. An internal examina- 
tion of validity tries to determine if the 
structure is intrinsically appropriate for 
the data. A relative test compares two 
structures and measures their relative 
merit. Indices used for this comparison 
are discussed in detail in Jain and 
Dubes [1988] and Dubes [1993], and are 
not discussed further in this paper. 

1.3 The User's Dilemma and the Role of 
Expertise 

The availability of such a vast collection 
of clustering algorithms in the litera- 
ture can easily confound a user attempt- 
ing to select an algorithm suitable for 
the problem at hand. In Dubes and Jain 
[1976], a set of admissibility criteria 
defined by Fisher and Van Ness [1971] 
are used to compare clustering algo- 
rithms. These admissibility criteria are 
based on: (1) the manner in which clus- 
ters are formed, (2) the structure of the 
data, and (3) sensitivity of the cluster- 
ing technique to changes that do not 
affect the structure of the data. How- 
ever, there is no critical analysis of clus- 
tering algorithms dealing with the im- 
portant questions such as 

—How should the data be normalized? 

—Which similarity measure is appropri- 
ate to use in a given situation? 

—How should domain knowledge be uti- 
lized in a particular clustering prob- 
lem? 



—How can a vary large data set (say, a 
million patterns) be clustered effi- 
ciently? 

These issues have motivated this sur- 
vey, and its aim is to provide a perspec- 
tive on the state of the art in clustering 
methodology and algorithms. With such 
a perspective, an informed practitioner 
should be able to confidently assess the 
tradeoffs of different techniques, and 
ultimately make a competent decision 
on a technique or suite of techniques to 
employ in a particular application. 

There is no clustering technique that 
is universally applicable in uncovering 
the variety of structures present in mul- 
tidimensional data sets. For example, 
consider the two-dimensional data set 
shown in Figure 1(a). Not all clustering 
techniques can uncover all the clusters 
present here with equal facility, because 
clustering algorithms often contain im- 
plicit assumptions about cluster shape 
or multiple-cluster configurations based 
on the similarity measures and group- 
ing criteria used. 

Humans perform competitively with 
automatic clustering procedures in two 
dimensions, but most real problems in- 
volve clustering in higher dimensions. It 
is difficult for humans to obtain an intu- 
itive interpretation of data embedded in 
a high-dimensional space. In addition, 
data hardly follow the "ideal" structures 
(e.g., hyperspherical, linear) shown in 
Figure 1. This explains the large num- 
ber of clustering algorithms which con- 
tinue to appear in the literature; each 
new clustering algorithm performs 
slightly better than the existing ones on 
a specific distribution of patterns. 

It is essential for the user of a cluster- 
ing algorithm to not only have a thor- 
ough understanding of the particular 
technique being utilized, but also to 
know the details of the data gathering 
process and to have some domain exper- 
tise; the more information the user has 
about the data at hand, the more likely 
the user would be able to succeed in 
assessing its true class structure [Jain 
and Dubes 1988]. This domain informa- 
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tion can also be used to improve the 
quality of feature extraction, similarity 
computation, grouping, and cluster rep- 
resentation [Murty and Jain 1995]. 

Appropriate constraints on the data 
source can be incorporated into a clus- 
tering procedure. One example of this is 
mixture resolving [Titterington et al. 
1985], wherein it is assumed that the 
data are drawn from a mixture of an 
unknown number of densities (often as- 
sumed to be multivariate Gaussian). 
The clustering problem here is to iden- 
tify the number of mixture components 
and the parameters of each component. 
The concept of density clustering and a 
methodology for decomposition of fea- 
ture spaces [Bajcsy 1997] have also 
been incorporated into traditional clus- 
tering methodology, yielding a tech- 
nique for extracting overlapping clus- 
ters. 

1.4 History 

Even though there is an increasing in- 
terest in the use of clustering methods 
in pattern recognition [Anderberg 
1973], image processing [Jain and 
Flynn 1996] and information retrieval 
[Rasmussen 1992; Salton 1991], cluster- 
ing has a rich history in other disci- 
plines [Jain and Dubes 1988] such as 
biology, psychiatry, psychology, archae- 
ology, geology, geography, and market- 
ing. Other terms more or less synony- 
mous with clustering include 
unsupervised learning [Jain and Dubes 
1988], numerical taxonomy [Sneath and 
Sokal 1973], vector quantization [Oehler 
and Gray 1995], and learning by obser- 
vation [Michalski and Stepp 1983]. The 
field of spatial analysis of point pat- 
terns [Ripley 1988] is also related to 
cluster analysis. The importance and 
interdisciplinary nature of clustering is 
evident through its vast literature. 

A number of books on clustering have 
been published [Jain and Dubes 1988; 
Anderberg 1973; Hartigan 1975; Spath 
1980; Duran and Odell 1974; Everitt 
1993; Backer 1995], in addition to some 
useful and influential review papers. A 



survey of the state of the art in cluster- 
ing circa 1978 was reported in Dubes 
and Jain [1980]. A comparison of vari- 
ous clustering algorithms for construct- 
ing the minimal spanning tree and the 
short spanning path was given in Lee 
[1981]. Cluster analysis was also sur- 
veyed in Jain et al. [1986]. A review of 
image segmentation by clustering was 
reported in Jain and Flynn [1996]. Com- 
parisons of various combinatorial opti- 
mization schemes, based on experi- 
ments, have been reported in Mishra 
and Raghavan [1994] and Al-Sultan and 
Khan [1996]. 



1.5 Outline 

This paper is organized as follows. Sec- 
tion 2 presents definitions of terms to be 
used throughout the paper. Section 3 
summarizes pattern representation, 
feature extraction, and feature selec- 
tion. Various approaches to the compu- 
tation of proximity between patterns 
are discussed in Section 4. Section 5 
presents a taxonomy of clustering ap- 
proaches, describes the major tech- 
niques in use, and discusses emerging 
techniques for clustering incorporating 
non-numeric constraints and the clus- 
tering of large sets of patterns. Section 
6 discusses applications of clustering 
methods to image analysis and data 
mining problems. Finally, Section 7 pre- 
sents some concluding remarks. 



2. DEFINITIONS AND NOTATION 

The following terms and notation are 
used throughout this paper. 

—A pattern (or feature vector, observa- 
tion, or datum) x is a single data item 
used by the clustering algorithm. It 
typically consists of a vector of d mea- 
surements: x = (x u . . . x d ). 

—The individual scalar components x t 
of a pattern x are called features (or 
attributes). 
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—d is the dimensionality of the pattern 
or of the pattern space. 

—A pattern set is denoted % = 
{x a , . . . x„}. The ith pattern in % is 
denoted x* = (x itl , . . . x iid ). In many 
cases a pattern set to be clustered is 
viewed as an n X d pattern matrix. 

—A class, in the abstract, refers to a 
state of nature that governs the pat- 
tern generation process in some cases. 
More concretely, a class can be viewed 
as a source of patterns whose distri- 
bution in feature space is governed by 
a probability density specific to the 
class. Clustering techniques attempt 
to group patterns so that the classes 
thereby obtained reflect the different 
pattern generation processes repre- 
sented in the pattern set. 

—Hard clustering techniques assign a 
class label l t to each patterns x ; , iden- 
tifying its class. The set of all labels 
for a pattern set 2P is £ = 
{l u . . . U, with Z, £ {1, • • k}, 
where k is the number of clusters. 

—Fuzzy clustering procedures assign to 
each input pattern x ; a fractional de- 
gree of membership fa in each output 
cluster./. 

—A distance measure (a specialization 
of a proximity measure) is a metric 
(or quasi-metric) on the feature space 
used to quantify the similarity of pat- 
terns. 

3. PATTERN REPRESENTATION, FEATURE 
SELECTION AND EXTRACTION 

There are no theoretical guidelines that 
suggest the appropriate patterns and 
features to use in a specific situation. 
Indeed, the pattern generation process 
is often not directly controllable; the 
user's role in the pattern representation 
process is to gather facts and conjec- 
tures about the data, optionally perform 
feature selection and extraction, and de- 
sign the subsequent elements of the 



clustering system. Because of the diffi- 
culties surrounding pattern representa- 
tion, it is conveniently assumed that the 
pattern representation is available prior 
to clustering. Nonetheless, a careful in- 
vestigation of the available features and 
any available transformations (even 
simple ones) can yield significantly im- 
proved clustering results. A good pat- 
tern representation can often yield a 
simple and easily understood clustering; 
a poor pattern representation may yield 
a complex clustering whose true struc- 
ture is difficult or impossible to discern. 
Figure 3 shows a simple example. The 
points in this 2D feature space are ar- 
ranged in a curvilinear cluster of ap- 
proximately constant distance from the 
origin. If one chooses Cartesian coordi- 
nates to represent the patterns, many 
clustering algorithms would be likely to 
fragment the cluster into two or more 
clusters, since it is not compact. If, how- 
ever, one uses a polar coordinate repre- 
sentation for the clusters, the radius 
coordinate exhibits tight clustering and 
a one-cluster solution is likely to be 
easily obtained. 

A pattern can measure either a phys- 
ical object (e.g., a chair) or an abstract 
notion (e.g., a style of writing). As noted 
above, patterns are represented conven- 
tionally as multidimensional vectors, 
where each dimension is a single fea- 
ture [Duda and Hart 1973]. These fea- 
tures can be either quantitative or qual- 
itative. For example, if weight and color 
are the two features used, then 
(20, black) is the representation of a 
black object with 20 units of weight. 
The features can be subdivided into the 
following types [Gowda and Diday 
1992]: 

(1) Quantitative features: e.g. 

(a) continuous values (e.g., weight); 

(b) discrete values (e.g., the number 
of computers); 

(c) interval values (e.g., the dura- 
tion of an event). 

(2) Qualitative features: 

(a) nominal or unordered (e.g., color); 
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Figure 3. A curvilinear cluster whose points 
are approximately equidistant from the origin. 
Different pattern representations (coordinate 
systems) would cause clustering algorithms to 
yield different results for this data (see text). 



(b) ordinal (e.g., military rank or 
qualitative evaluations of tem- 
perature ("cool" or "hot") or 
sound intensity ("quiet" or 
"loud")). 

Quantitative features can be measured 
on a ratio scale (with a meaningful ref- 
erence value, such as temperature), or 
on nominal or ordinal scales. 

One can also use structured features 
[Michalski and Stepp 1983] which are 
represented as trees, where the parent 
node represents a generalization of its 
child nodes. For example, a parent node 
"vehicle" may be a generalization of 
children labeled "cars," "buses," 
"trucks," and "motorcycles." Further, 
the node "cars" could be a generaliza- 
tion of cars of the type "Toyota," "Ford," 
"Benz," etc. A generalized representa- 
tion of patterns, called symbolic objects 
was proposed in Diday [1988]. Symbolic 
objects are defined by a logical conjunc- 
tion of events. These events link values 
and features in which the features can 
take one or more values and all the 
objects need not be defined on the same 
set of features. 

It is often valuable to isolate only the 
most descriptive and discriminatory fea- 
tures in the input set, and utilize those 
features exclusively in subsequent anal- 
ysis. Feature selection techniques iden- 



tify a subset of the existing features for 
subsequent use, while feature extrac- 
tion techniques compute new features 
from the original set. In either case, the 
goal is to improve classification perfor- 
mance and/or computational efficiency. 
Feature selection is a well-explored 
topic in statistical pattern recognition 
[Duda and Hart 1973]; however, in a 
clustering context (i.e., lacking class la- 
bels for patterns), the feature selection 
process is of necessity ad hoc, and might 
involve a trial-and-error process where 
various subsets of features are selected, 
the resulting patterns clustered, and 
the output evaluated using a validity 
index. In contrast, some of the popular 
feature extraction processes (e.g., prin- 
cipal components analysis [Fukunaga 
1990]) do not depend on labeled data 
and can be used directly. Reduction of 
the number of features has an addi- 
tional benefit, namely the ability to pro- 
duce output that can be visually in- 
spected by a human. 

4. SIMILARITY MEASURES 

Since similarity is fundamental to the 
definition of a cluster, a measure of the 
similarity between two patterns drawn 
from the same feature space is essential 
to most clustering procedures. Because 
of the variety of feature types and 
scales, the distance measure (or mea- 
sures) must be chosen carefully. It is 
most common to calculate the dissimi- 
larity between two patterns using a dis- 
tance measure defined on the feature 
space. We will focus on the well-known 
distance measures used for patterns 
whose features are all continuous. 

The most popular metric for continu- 
ous features is the Euclidean distance 

d 



= Ik - Xji, 

which is a special case (p=2) of the 
Minkowski metric 
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d p (.i i ,Xj) = (hx i , k -Xj ik n*> 

The Euclidean distance has an intuitive 
appeal as it is commonly used to evalu- 
ate the proximity of objects in two or 
three-dimensional space. It works well 
when a data set has "compact" or "iso- 
lated" clusters [Mao and Jain 1996]. 
The drawback to direct use of the 
Minkowski metrics is the tendency of 
the largest-scaled feature to dominate 
the others. Solutions to this problem 
include normalization of the continuous 
features (to a common range or vari- 
ance) or other weighting schemes. Lin- 
ear correlation among features can also 
distort distance measures; this distor- 
tion can be alleviated by applying a 
whitening transformation to the data or 
by using the squared Mahalanobis dis- 
tance 

d M (x u xj) = (* - x^S-^x, - X,) r , 

where the patterns X; and x, are as- 
sumed to be row vectors, and 2 is the 
sample covariance matrix of the pat- 
terns or the known covariance matrix of 
the pattern generation process; d M {- , •) 
assigns different weights to different 
features based on their variances and 
pairwise linear correlations. Here, it is 
implicitly assumed that class condi- 
tional densities are unimodal and char- 
acterized by multidimensional spread, 
i.e., that the densities are multivariate 
Gaussian. The regularized Mahalanobis 
distance was used in Mao and Jain 
[1996] to extract hyperellipsoidal clus- 
ters. Recently, several researchers 
[Huttenlocher et al. 1993; Dubuisson 
and Jain 1994] have used the Hausdorff 
distance in a point set matching con- 
text. 

Some clustering algorithms work on a 
matrix of proximity values instead of on 
the original pattern set. It is useful in 
such situations to precompute all the 



n(n - l)/2 pairwise distance values 
for the n patterns and store them in a 
(symmetric) matrix. 

Computation of distances between 
patterns with some or all features being 
noncontinuous is problematic, since the 
different types of features are not com- 
parable and (as an extreme example) 
the notion of proximity is effectively bi- 
nary-valued for nominal-scaled fea- 
tures. Nonetheless, practitioners (espe- 
cially those in machine learning, where 
mixed-type patterns are common) have 
developed proximity measures for heter- 
ogeneous type patterns. A recent exam- 
ple is Wilson and Martinez [1997], 
which proposes a combination of a mod- 
ified Minkowski metric for continuous 
features and a distance based on counts 
(population) for nominal attributes. A 
variety of other metrics have been re- 
ported in Diday and Simon [1976] and 
Ichino and Yaguchi [1994] for comput- 
ing the similarity between patterns rep- 
resented using quantitative as well as 
qualitative features. 

Patterns can also be represented us- 
ing string or tree structures [Knuth 
1973]. Strings are used in syntactic 
clustering [Fu and Lu 1977]. Several 
measures of similarity between strings 
are described in Baeza- Yates [1992]. A 
good summary of similarity measures 
between trees is given by Zhang [1995]. 
A comparison of syntactic and statisti- 
cal approaches for pattern recognition 
using several criteria was presented in 
Tanaka [1995] and the conclusion was 
that syntactic methods are inferior in 
every aspect. Therefore, we do not con- 
sider syntactic methods further in this 
paper. 

There are some distance measures re- 
ported in the literature [Gowda and 
Krishna 1977; Jarvis and Patrick 1973] 
that take into account the effect of sur- 
rounding or neighboring points. These 
surrounding points are called context in 
Michalski and Stepp [1983]. The simi- 
larity between two points x,- and x), 
given this context, is given by 
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e more similar than A 



s(x h X,) = f{x„ Xj, *), 

where % is the context (the set of sur- 
rounding points). One metric defined 
using context is the mutual neighbor 
distance (MND), proposed in Gowda and 
Krishna [1977], which is given by 

MND(Xi, x y ) = NN(x h xj) + NN(xj, x,), 

where NNix,, x/) is the neighbor num- 
ber of x, with respect to x ( . Figures 4 
and 5 give an example. In Figure 4, the 
nearest neighbor of A is B, and B's 
nearest neighbor is A. So, NN(A, B) = 
NN(B, A) = 1 and the MND between 
A and B is 2. However, NN{B, C) = 1 
but NN(C, B) = 2, and therefore 
MND(B, C) = 3. Figure 5 was ob- 
tained from Figure 4 by adding three new 
points D, E, and F. Now MND(B, C) 
= 3 (as before), but MND (A, B) = 5. 
The MND between A and B has in- 
creased by introducing additional 
points, even though A and B have not 
moved. The MND is not a metric (it does 
not satisfy the triangle inequality 
[Zhang 1995]). In spite of this, MND has 
been successfully applied in several 
clustering applications [Gowda and Di- 
day 1992]. This observation supports 
the viewpoint that the dissimilarity 
does not need to be a metric. 



Watanabe's theorem of the ugly duck- 
ling [Watanabe 1985] states: 

"Insofar as we use a finite set of 
predicates that are capable of dis- 
tinguishing any two objects con- 
sidered, the number of predicates 
shared by any two such objects is 
constant, independent of the 
choice of objects." 

This implies that it is possible to 
make any two arbitrary patterns 
equally similar by encoding them with a 
sufficiently large number of features. As 
a consequence, any two arbitrary pat- 
terns are equally similar, unless we use 
some additional domain information. 
For example, in the case of conceptual 
clustering LMichalski and Stepp 1983], 
the similarity between X; and x, is de- 
fined as 

S (Xi, Xj) - f(x h Xj, % %), 

where % is a set of pre-defined concepts. 
This notion is illustrated with the help 
of Figure 6. Here, the Euclidean dis- 
tance between points A and B is less 
than that between B and C. However, B 
and C can be viewed as "more similar" 
than A and B because B and C belong to 
the same concept (ellipse) and A belongs 
to a different concept (rectangle). The 
conceptual similarity measure is the 
most general similarity measure. We 
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Figure 6. Conceptual similarity be- 
tween points . 



discuss several pragmatic issues associ- 
ated with its use in Section 5. 

,/ 

5. CLUSTERING TECHNIQUES 

Different approaches to clustering data 
can be described with the help of the 
hierarchy shown in Figure 7 (other tax- 
onometric representations of clustering 
methodology are possible; ours is based 
on the discussion in Jain and Dubes 
[1988]). At the top level, there is a dis- 
tinction between hierarchical and parti- 
tional approaches (hierarchical methods 
produce a nested series of partitions, 
while partitional methods produce only 
one). 

The taxonomy shown in Figure 7 
must be supplemented by a discussion 
of cross-cutting issues that may (in 
principle) affect all of the different ap- 
proaches regardless of 1 their placement 
in the taxonomy. 

—Agglomerative vs. divisive: This as- 
pect relates to algorithmic structure 
and operation. An agglomerative ap- 
proach begins with each pattern in a 
distinct (singleton) cluster, and suc- 
cessively merges clusters together un- 
til a stopping criterion is satisfied. A 
divisive method begins with all pat- 
terns in a single cluster and perforins 
splitting until a stopping criterion is 
met. 



— Monothetic vs. polythetic: This aspect 
relates to the sequential or simulta- 
neous use of features in the clustering 
process. Most algorithms are polythe- 
tic; that is, all features enter into the 
computation of distances between 
patterns, and decisions are based on 
those distances. A simple monothetic 
algorithm reported in Anderberg 
[1973] considers features sequentially 
to divide the given collection of pat- 
terns. This is illustrated in Figure 8. 
Here, the collection is divided into 
two groups using feature X\, the verti- 
cal broken line V is the separating 
line. Each of these clusters is further 
divided independently using feature 
x 2 , as depicted by the broken lines H l 
and H z . The major problem with this 
algorithm is that it generates 2 d clus- 
ters where d is the dimensionality of 
the patterns. For large values of d 
{d > 100 is typical in information re- 
trieval applications [Salton 1991]), 
the number of clusters generated by 
this algorithm is so large that the 
data set is divided into uninterest- 
ingly small and fragmented clusters. 

—Hard vs. fuzzy: A hard clustering al- 
gorithm allocates each pattern to a 
single cluster during its operation and 
in its output. A fuzzy clustering 
method assigns degrees of member- 
ship in several clusters to each input 
pattern. A fuzzy clustering can be 
converted to a hard clustering by as- 
signing each pattern to the cluster 
with the largest measure of member- 
ship. 

—Deterministic vs. stochastic: This is- 
sue is most relevant to partitional 
approaches designed to optimize a 
squared error function. This optimiza- 
tion can be accomplished using tradi- 
tional techniques or through a ran- 
dom search of the state space 
consisting of all possible labelings. 

—Incremental vs. non-incremental: 
This issue arises when the pattern set 
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Figure 7. A taxonomy of clustering approaches. 



to be clustered is large, and con- 
straints on execution time or memory 
space affect the architecture of the 
algorithm. The early history of clus- 
tering methodology does not contain 
many examples of clustering algo- 
rithms designed to work with large 
data sets, but the advent of data min- 
ing has fostered the development of 
clustering algorithms that minimize 
the number of scans through the pat- 
tern set, reduce the number of pat- 
terns examined during execution, or 
reduce the size of data structures 
used in the algorithm's operations. 
A cogent observation in Jain and 
Dubes [1988] is that the specification of 
an algorithm for clustering usually 
leaves considerable flexibilty in imple- 
mentation. 

5.1 Hierarchical Clustering Algorithms 

The operation of a hierarchical cluster- 
ing algorithm is illustrated using the 
two-dimensional data set in Figure 9. 
This figure depicts seven patterns la- 
beled A, B, C, D, E, F, and G in three 
clusters. A hierarchical algorithm yields 
a dendrogram representing the nested 
grouping of patterns and similarity lev- 
els at which groupings change. A den- 
drogram corresponding to the seven 
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Figure 8. Monothetic partitional clustering. 



points in Figure 9 (obtained from the 
single-link algorithm [Jain and Dubes 
1988]) is shown in Figure 10. The den- 
drogram can be broken at different lev- 
els to yield different clusterings of the 
data. 

Most hierarchical clustering algo- 
rithms are variants of the single-link 
[Sneath and Sokal 1973], complete-link 
[King 1967], and minimum-variance 
[Ward 1963; Murtagh 1984] algorithms. 
Of these, the single-link and complete- 
link algorithms are most popular. These 
two algorithms differ in the way they 
characterize the similarity between a 
pair of clusters. In the single-link 
method, the distance between two clus- 
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Figure 9. Points falling in three clusters. 
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Figure 1 1 . Two concentric clusters. 



ters is the minimum of the distances 
between all pairs of patterns drawn 
from the two clusters (one pattern from 
the first cluster, the other from the sec- 
ond). In the complete-link algorithm, 
the distance between two clusters is the 
i of all pairwise distances be- 



tween patterns in the two clusters. In 
either case, two clusters are merged to 
form a larger cluster based on minimum 
distance criteria. The complete-link al- 
gorithm produces tightly bound or com- 
pact clusters [Baeza-Yates 1992]. The 
single-link algorithm, by contrast, suf- 
fers from a chaining effect [Nagy 1968], 
It has a tendency to produce clusters 
that are straggly or elongated. There 
are two clusters in Figures 12 and 13 
separated by a "bridge" of noisy pat- 
terns. The single-link algorithm pro- 
duces the clusters shown in Figure 12, 
whereas the complete-link algorithm ob- 
tains the clustering shown in Figure 13. 
The clusters obtained by the complete- 
link algorithm are more compact than 
those obtained by the single-link algo- 
rithm; the cluster labeled 1 obtained 
using the single-link algorithm is elon- 
gated because of the noisy patterns la- 
beled "*". The single-link algorithm is 
more versatile than the complete-link 
algorithm, otherwise. For example, the 
single-link algorithm can extract the 
concentric clusters shown in Figure 11, 
but the complete-link algorithm cannot. 
However, from a pragmatic viewpoint, it 
has been observed that the complete- 
link algorithm produces more useful hi- 
erarchies in many applications than the 
single-link algorithm [Jain and Dubes 
1988]. 

Agglomerative Single-Link Clus- 
tering Algorithm 

(1) Place each pattern in its own clus- 
ter. Construct a list of interpattern 
distances for all distinct unordered 
pairs of patterns, and sort this list 
in ascending order. 

(2) Step through the sorted list of dis- 
tances, forming for each distinct dis- 
similarity value d k a graph on the 
patterns where pairs of patterns 
closer than d k are connected by a 
graph edge. If all the patterns are 
members of a connected graph, stop. 
Otherwise, repeat this step. 
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Figure 12. A single-link clustering of a pattern 
set containing two classes (1 and 2) connected by 
a chain of noisy patterns (*). 



(3) The output of the algorithm is a 
nested hierarchy of graphs which 
can be cut at a desired dissimilarity 
level forming a partition (clustering) 
identified by simply connected com- 
ponents in the corresponding graph. 
Agglomerative Complete-Link Clus- 
tering Algorithm 

(1) Place each pattern in its own clus- 
ter. Construct a list of interpattern 
distances for all distinct unordered 
pairs of patterns, and sort this list 
in ascending order. 

(2) Step through the sorted list of dis- 
tances, forming for each distinct dis- 
similarity value d h a graph on the 
patterns where pairs of patterns 
closer than d k are connected by a 
graph edge. If all the patterns are 
members of a completely connected 
graph, stop. 

(3) The output of the algorithm is a 
nested hierarchy of graphs which 
can be cut at a desired dissimilarity 
level forming a partition (clustering) 
identified by completely connected 
components in the corresponding 
graph. 

Hierarchical algorithms are more ver- 
satile than partitional algorithms. For 
example, the single-link clustering algo- 
rithm works well on data sets contain- 
ing non-isotropic clusters, including 




Figure 13. A complete-link clustering of a pat- 
tern set containing two classes (1 and 2) con- 
nected by a chain of noisy patterns (*). 



well-separated, chain-like, and concen- 
tric clusters, whereas a typical parti- 
tional algorithm such as the &-means 
algorithm works well only on data sets 
having isotropic clusters [Nagy 1968]. 
On the other hand, the time and space 
complexities [Day 1992] of the parti- 
tional algorithms are typically lower 
than those of the hierarchical algo- 
rithms. It is possible to develop hybrid 
algorithms [Murty and Krishna 1980] 
that exploit the good features of both 
categories. 

Hierarchical Agglomerative Clus- 
tering Algorithm 

(1) Compute the proximity matrix con- 
taining the distance between each 
pair of patterns. Treat each pattern 
as a cluster. 

(2) Find the most similar pair of clus- 
ters using the proximity matrix. 
Merge these two clusters into one 
cluster. Update the proximity ma- 
trix to reflect this merge operation. 

(3) If all patterns are in one cluster, 
stop. Otherwise, go to step 2. 

Based on the way the proximity matrix 
is updated in step 2, a variety of ag- 
glomerative algorithms can be designed. 
Hierarchical divisive algorithms start 
with a single cluster of all the given 
objects and keep splitting the clusters 
based on some criterion to obtain a par- 
tition of singleton clusters. 
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5.2 Partitional Algorithms 

A partitional clustering algorithm ob- 
tains a single partition of the data in- 
stead of a clustering structure, such as 
the dendrogram produced by a hierar- 
chical technique. Partitional methods 
have advantages in applications involv- 
ing large data sets for which the con- 
struction of a dendrogram is computa- 
tionally prohibitive. A problem 
accompanying the use of a partitional 
algorithm is the choice of the number of 
desired output clusters. A seminal pa- 
per [Dubes 1987] provides guidance on 
this key design decision. The partitional 
techniques usually produce clusters by 
optimizing a criterion function defined 
either locally (on a subset of the pat- 
terns) or globally (defined over all of the 
patterns). Combinatorial search of the 
set of possible labelings for an optimum 
value of a criterion is clearly computa- 
tionally prohibitive. In practice, there- 
fore, the algorithm is typically run mul- 
tiple times with different starting 
states, and the best configuration ob- 
tained from all of the runs is used as the 
output clustering. 

5.2.1 Squared Error Algorithms. 
The most intuitive and frequently used 
criterion function in partitional cluster- 
ing techniques is the squared error cri- 
terion, which tends to work well with 
isolated and compact clusters. The 
squared error for a clustering 2 of a 
pattern set 8£ (containing K clusters) is 

where xf^ is the i' h pattern belonging to 
the j' h cluster and c, is the centroid of 
the j th cluster. 

The k -means is the simplest and most 
commonly used algorithm employing a 
squared error criterion [McQueen 1967]. 
It starts with a random initial partition 
and keeps reassigning the patterns to 
clusters based on the similarity between 
the pattern and the cluster centers until 



a convergence criterion is met (e.g., 
there is no reassignment of any pattern 
from one cluster to another, or the 
squared error ceases to decrease signifi- 
cantly after some number of iterations). 
The k -means algorithm is popular be- 
cause it is easy to implement, and its 
time complexity is O(n), where n is the 
number of patterns. A major problem 
with this algorithm is that it is sensitive 
to the selection of the initial partition 
and may converge to a local minimum of 
the criterion function value if the initial 
partition is not properly chosen. Figure 
14 shows seven two-dimensional pat- 
terns. If we start with patterns A, B, 
and C as the initial means around 
which the three clusters are built, then 
we end up with the partition {{A}, {B, 
C}, {D, E, F, G}} shown by ellipses. The 
squared error criterion value is much 
larger for this partition than for the 
best partition {{A, B, C}, {D, E}, {F, G}} 
shown by rectangles, which yields the 
global minimum value of the squared 
error criterion function for a clustering 
containing three clusters. The correct 
three-cluster solution is obtained by 
choosing, for example, A, D, and F as 
the initial cluster means. 

Squared Error Clustering Method 

(1) Select an initial partition of the pat- 
terns with a fixed number of clus- 
ters and cluster centers. 

(2) Assign each pattern to its closest 
cluster center and compute the new 
cluster centers as the centroids of 
the clusters. Repeat this step until 
convergence is achieved, i.e., until 
the cluster membership is stable. 

(3) Merge and split clusters based on 
some heuristic information, option- 
ally repeating step 2. 

k -Means Clustering Algorithm 

(1) Choose k cluster centers to coincide 
with k randomly-chosen patterns or 
k randomly defined points inside 
the hypervolume containing the pat- 
tern set. 
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Figure 14. The A-means algorithm is sensitive 
to the initial partition. 



(2) Assign each pattern to the closest 
cluster center. . 

(3) Recompute the cluster centers using 
the current cluster memberships. 

(4) If a convergence criterion is not met, 
go to step 2. Typical convergence 
criteria are: no (or minimal) reas- 
signment of patterns to new cluster 
centers, or minimal decrease in 
squared error. 

Several variants [Anderberg 1973] of 
the k -means algorithm have been re- 
ported in the literature. Some of them 
attempt to select a good initial partition 
so that the algorithm is more likely to 
find the global minimum value. 

Another variation is to permit split- 
ting and merging of the resulting clus- 
ters. Typically, a cluster is split when 
its variance is above a pre-specified 
threshold, and two clusters are merged 
when the distance between their cen- 
troids is below another pre-specified 
threshold. Using this variant, it is pos- 
sible to obtain the optimal partition 
starting from any arbitrary initial parti- 
tion, provided proper threshold values 
are specified. The well-known ISO- 
DATA [Ball and Hall 1965] algorithm 
employs this technique of merging and 
splitting clusters. If ISODATA is given 
the "ellipse" partitioning shown in Fig- 
ure 14 as an initial partitioning, it will 
produce the optimal three-cluster parti- 



tioning. ISODATA will first merge the k 
clusters {A} and {B,C} into one cluster 
because the distance between their cen- 
troids is small and then split the cluster 
{D,E,F,G}, which has a large variance, 
into two clusters {D,E} and {F,G>. 

Another, variation of the k -means al- 
gorithm involves selecting a different 
criterion function altogether. The dy- 
namic clustering algorithm (which per- 
mits representations other than the 
centroid for each cluster) was proposed 
in Diday [1973], and Symon [1977] and 
describes a dynamic clustering ap- 
proach obtained by formulating the 
clustering problem in the framework of 
maximum-likelihood estimation. The 
regularized Mahalanobis distance was 
used in Mao and Jain [1996] to obtain 
hyperellipsoidal clusters. 

5.2.2 Graph-Theoretic Clustering. 
The best-known graph-theoretic divisive 
clustering algorithm is based on con- 
struction of the minimal spanning tree 
(MST) of the data [Zahn 1971], and then 
deleting the MST edges with the largest 
lengths to generate clusters. Figure 15 
depicts the MST obtained from nine 
two-dimensional points. By breaking 
the link labeled CD with a length of 6 
units (the edge with the maximum Eu- 
clidean length), two clusters ({A, B, C} 
and {D, E, F, G, H, I}) are obtained. The 
second cluster can be further divided 
into two clusters by breaking the edge 
EF, which has a length of 4.5 units. 

The hierarchical approaches are also 
related to graph-theoretic clustering. 
Single-link clusters are subgraphs of 
the minimum spanning tree of the data 
[Gower and Ross 1969] which are also 
the connected components [Gotlieb and 
Kumar 1968], Complete-link clusters 
are maximal complete subgraphs, and 
are related to the node colorability of 
graphs [Backer and Hubert 1976]. The 
maximal complete subgraph was consid- 
ered the strictest definition of a cluster 
in Augustson and Minker [1970] and 
Raghavan and Yu [1981]. A graph-ori- 
ented approach for non-hierarchical 
structures and overlapping clusters is 
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Figure 15. Using the minimal spanning tree to 
form clusters. 



presented in Ozawa [1985], The Delau- 
nay graph (DG) is obtained by connect- 
ing all the pairs of points that are 
Voronoi neighbors. The DG contains all 
the neighborhood information contained 
in the MST and the relative neighbor- 
hood graph (RNG) [Toussaint 1980]. 

5.3 Mixture-Resolving and Mode-Seeking 
Algorithms 

The mixture resolving approach to clus- 
ter analysis has been addressed in a 
number of ways. The underlying as- 
sumption is that the patterns to be clus- 
tered are drawn from one of several 
distributions, and the goal is to identify 
the parameters of each and (perhaps) 
their number. Most of the work in this 
area has assumed that the individual 
components of the mixture density are 
Gaussian, and in this case the parame- 
ters of the individual Gaussians are to 
be estimated by the procedure. Tradi- 
tional approaches to this problem in- 
volve obtaining (iteratively) a maximum 
likelihood estimate of the parameter 
vectors of the component densities [Jain 
and Dubes 1988]. 

More recently, the Expectation Maxi- 
mization (EM) algorithm (a general- 
purpose maximum likelihood algorithm 
[Dempster et al. 1977] for missing-data 
problems) has been applied to the prob- 
lem of parameter estimation. A recent 
book [Mitchell 1997] provides an acces- 



sible description of the technique. In the 
EM framework, the parameters of the 
component densities are unknown, as 
are the mixing parameters, and these 
are estimated from the patterns. The 
EM procedure begins with an initial 
estimate of the parameter vector and 
iteratively rescores the patterns against 
the mixture density produced by the 
parameter vector. The rescored patterns 
are then used to update the parameter 
estimates. In a clustering context, the 
scores of the patterns (which essentially 
measure their likelihood of being drawn 
from particular components of the mix- 
ture) can be viewed as hints at the class 
of the pattern. Those patterns, placed 
(by their scores) in a particular compo- 
nent, would therefore be viewed as be- 
longing to the same cluster. 

Nonparametric techniques for densi- 
ty-based clustering have also been de- 
veloped [Jain and Dubes 1988]. Inspired 
by the Parzen window approach to non- 
parametric density estimation, the cor- 
responding clustering procedure 
searches for bins with large counts in a 
multidimensional histogram of the in- 
put pattern set. Other approaches in- 
clude the application of another parti- 
tional or hierarchical clustering 
algorithm using a distance measure 
based on a nonparametric density esti- 
mate. 



5.4 Nearest Neighbor Clustering 

Since proximity plays a key role in our 
intuitive notion of a cluster, nearest- 
neighbor distances can serve as the ba- 
sis of clustering procedures. An itera- 
tive procedure was proposed in Lu and 
Fu [1978]; it assigns each unlabeled 
pattern to the cluster of its nearest la- 
beled neighbor pattern, provided the 
distance to that labeled neighbor is be- 
low a threshold. The process continues 
until all patterns are labeled or no addi- 
tional labelings occur. The mutual 
neighborhood value (described earlier in 
the context of distance computation) can 
also be used to grow clusters from near 
neighbors. 
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5.5 Fuzzy Clustering 

Traditional clustering approaches gen- 
erate partitions; in a partition, each 
pattern belongs to one and only one 
cluster. Hence, the clusters in a hard 
clustering are disjoint. Fuzzy clustering 
extends this notion to associate each 
pattern with every cluster using a mem- 
bership function [Zadeh 1965]. The out- 
put of such algorithms is a clustering, 
but not a partition. We give a high-level 
partitional fuzzy clustering algorithm 
below. 

Fuzzy Clustering Algorithm 

(1) Select an initial fuzzy partition of 
the N objects into K clusters by 
selecting the N X K membership 
matrix U. An element of this 
matrix represents the grade of mem- 
bership of object x ; in cluster c,-. 
Typically, G [0,1]. 

(2) Using U, find the value of a fuzzy 
criterion function, e.g., a weighted 
squared error criterion function, as- 
sociated with the corresponding par- 
tition. One possible fuzzy criterion 
function is 

W,u)= SE««lk-c*f, 



where e h = 2 u ik x.i is the k th fuzzy 
cluster center. 

Reassign patterns to clusters to re- 
duce this criterion function value 
and recompute U. 

(3) Repeat step 2 until entries in U do 
not change significantly. 

In fuzzy clustering, each cluster is a 
fuzzy set of all the patterns. Figure 16 
illustrates the idea. The rectangles en- 
close two "hard" clusters in the data: 
ffi = {1,2,3,4,5} and H 2 = {6,7,8,9}. 
A fuzzy clustering algorithm might pro- 
duce the two fuzzy clusters F t and F 2 
depicted by ellipses. The patterns will 
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Figure 16. Fuzzy clusters. 

have membership values in [0,1] for 
each cluster. For example, fuzzy cluster 
Fi could be compactly described as 

{(1,0.9), (2,0.8), (3,0.7), (4,0.6), (5,0.55), 

(6,0.2), (7,0.2), (8,0.0), (9,0.0)} 

and F t could be described as 

{(1,0.0), (2,0.0), (3,0.0), (4,0.1), (5,0.15), 

(6,0.4), (7,0.35), (8,1.0), (9,0.9)} 

The ordered pairs (i, jli.) in each cluster 
represent the ith pattern and its mem- 
bership value to the cluster /i.;. Larger 
membership values indicate higher con- 
fidence in the assignment of the pattern 
to the cluster. A hard clustering can be 
obtained from a fuzzy partition by 
thresholding the membership value. 

Fuzzy set theory was initially applied 
to clustering in Ruspini [1969]. The 
book by Bezdek [1981] is a good source 
for material on fuzzy clustering. The 
most popular fuzzy clustering algorithm 
is the fuzzy c -means (FCM) algorithm. 
Even though it is better than the hard 
k -means algorithm at avoiding local 
minima, FCM can still converge to local 
minima of the squared error criterion. 
The design of membership functions is 
the most important problem in fuzzy 
clustering; different choices include 
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those based on similarity decomposition 
and centroids of clusters. A generaliza- 
tion of the FCM algorithm was proposed 
by Bezdek [1981] through a family of 
objective functions. A fuzzy c-shell algo- 
rithm and an adaptive variant for de- 
tecting circular and elliptical bound- 
aries was presented in Dave [1992]. 

5.6 Representation of Clusters 

In applications where the number of 
classes or clusters in a data set must be 
discovered, a partition of the data set is 
the end product. Here, a partition gives 
an idea about the separability of the 
data points into clusters and whether it 
is meaningful to employ a supervised 
classifier that assumes a given number 
of classes in the data set. However, in 
many other applications that involve 
decision making, the resulting clusters 
have to be represented or described in a 
compact form to achieve data abstrac- 
tion. Even though the construction of a 
cluster representation is an important 
step in decision making, it has not been 
examined closely by researchers. The 
notion of cluster representation was in- 
troduced in Duran and Odell [1974] and 
was subsequently studied in Diday and 
Simon [1976] and Michalski et al. 
[1981]. They suggested the following 
representation schemes: 

(1) Represent a cluster of points by 
their centroid or by a set of distant 
points in the cluster. Figure 17 de- 
picts these two ideas. 



(2) Represent clusters using nodes in a 
classification tree. This is illus- 
trated in Figure 18. 

(3) Represent clusters by using conjunc- 
tive logical expressions. For example, 
the expression \X X > 3][X 2 < 2] in 
Figure 18 stands for the logical state- 
ment i X 1 is greater than 3' and 'X 2 is 
less than 2'. 

Use of the centroid to represent a 
cluster is the most popular scheme. It 
works well when the clusters are com- 
pact or isotropic. However, when the 
clusters are elongated or non-isotropic, 
then this scheme fails to represent them 
properly. In such a case, the use of a 
collection of boundary points in a clus- 
ter captures its shape well. The number 
of points used to represent a cluster 
should increase as the complexity of its 
shape increases. The two different rep- 
resentations illustrated in Figure 18 are 
equivalent. Every path in a classifica- 
tion tree from the root node to a leaf 
node corresponds to a conjunctive state- 
ment. An important limitation of the 
typical use of the simple conjunctive 
concept representations is that they can 
describe only rectangular or isotropic 
clusters in the feature space. 

Data abstraction is useful in decision 
making because of the following: 

(1) It gives a simple and intuitive de- 
scription of clusters which is easy 
for human comprehension. In both 
conceptual clustering [Michalski 
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Figure 18. Representation of clusters by a classification tree or by conjunctive statements. 



and Stepp 1983] and symbolic clus- 
tering [Gowda and Diday 1992] this 
representation is obtained without 
using an additional step. These al- 
gorithms generate the clusters as 
well as their descriptions. A set of 
fuzzy rules can be obtained from 
fuzzy clusters of a data set. These 
rules can be used to build fuzzy clas- 
sifiers and fuzzy controllers. 
(2) It helps in achieving data compres- 
sion that can be exploited further by 
a computer [Murty and Krishna 
1980]. Figure 19(a) shows samples 
belonging to two chain-like clusters 
labeled 1 and 2. A partitional clus- 
tering like the k -means algorithm 
cannot separate these two struc- 
tures properly. The single-link algo- 
rithm works well on this data, but is 
computationally expensive. So a hy- 
brid approach may be used to ex- 
ploit the desirable properties of both 
these algorithms. We obtain 8 sub- 
clusters of the data using the (com- 
putationally efficient) k -means algo- 
rithm. Each of these subclusters can 
be represented by their centroids as 
shown in Figure 19(a). Now the sin- 
gle-link algorithm can be applied on 
these centroids alone to cluster 
them into 2 groups. The resulting 
groups are shown in Figure 19(b). 
Here, a data reduction is achieved 



by representing the subclusters by 
their centroids. 

(3) It increases the efficiency of the de- 
cision making task. In a cluster- 
based document retrieval technique 
[Salton 1991], a large collection of 
documents is clustered and each of 
the clusters is represented using its 
centroid. In order to retrieve docu- 
ments relevant to a query, the query 
is matched with the cluster cen- 
troids rather than with all the docu- 
ments. This helps in retrieving rele- 
vant documents efficiently. Also in 
several applications involving large 
data sets, clustering is used to per- 
form indexing, which helps in effi- 
cient decision making [Dorai and 
Jain 1995]. 



5.7 Artificial Neural Networks for 
Clustering 

Artificial neural networks (ANNs) 
[Hertz et al. 1991] are motivated by 
biological neural networks. ANNs have 
been used extensively over the past 
three decades for both classification and 
clustering [Sethi and Jain 1991; Jain 
and Mao 1994]. Some of the features of 
the ANNs that are important in pattern 
clustering are: 
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Figure 19. Data compression by clustering. 



(1) ANNs process numerical vectors and 
so require patterns to be represented 
using quantitative features only. 

(2) ANNs are inherently parallel and 
distributed processing architec- 
tures. 

(3) ANNs may learn their interconnec- 
tion weights adaptively [Jain and 
Mao 1996; Oja 1982]. More specifi- 
cally, they can act as pattern nor- 
malizes and feature selectors by 
appropriate selection of weights. ' 

Competitive (or winner-take-all) 
neural networks [Jain and Mao 1996] 
are often used to cluster input data. In 
competitive learning, similar patterns 
are grouped by the network and repre- 
sented by a single unit (neuron). This 
grouping is done automatically based on 
data correlations. Well-known examples 
of ANNs used for clustering include Ko- 
honen's learning vector quantization 
(LVQ) and self-organizing map (SOM) 
[Kohonen 1984], and adaptive reso- 
nance theory models [Carpenter and 
Grossberg 1990]. The architectures of 
these ANNs are simple: they are single- 
layered. Patterns are presented at the 
input and are associated with the out- 
put nodes. The weights between the in- 
put nodes and the output nodes are 
iteratively changed (this is called learn- 
ing) until a termination criterion is sat- 
isfied. Competitive learning has been 
found to exist in biological neural net- 
works. However, the learning or weight 
update procedures are quite similar to 



those in some classical clustering ap- 
proaches. For example, the relationship 
between the k -means algorithm and 
LVQ is addressed in Pal et al. [1993]. 
The learning algorithm in ART models 
is similar to the leader clustering algo- 
rithm [Moor 1988]. 

The SOM gives an intuitively appeal- 
ing two-dimensional map of the multidi- 
mensional data set, and it has been 
successfully used for vector quantiza- 
tion and speech recognition [Kohonen 
1984]. However, like its sequential 
counterpart, the SOM generates a sub- 
optimal partition if the initial weights 
are not chosen properly. Further, its 
convergence is controlled by various pa- 
rameters such as the learning rate and 
a neighborhood of the winning node in 
which learning takes place. It is possi- 
ble that a particular input pattern can 
fire different output units at different 
iterations; this brings up the stability 
issue of learning systems. The system is 
said to be stable if no pattern in the 
training data changes its category after 
a finite number of learning iterations. 
This problem is closely associated with 
the problem of plasticity, which is the 
ability of the algorithm to adapt to new 
data. For stability, the learning rate 
should be decreased to zero as iterations 
progress and this affects the plasticity. 
The ART models are supposed to be 
stable and plastic [Carpenter and 
Grossberg 1990]. However, ART nets 
are order-dependent; that is, different 
partitions are obtained for different or- 
ders in which the data is presented to 
the net. Also, the size and number of 
clusters generated by an ART net de- 
pend on the value chosen for the vigi- 
lance threshold, which is used to decide 
whether a pattern is to be assigned to 
one of the existing clusters or start a 
new cluster. Further, both SOM and 
ART are suitable for detecting only hy- 
perspherical clusters [Hertz et al. 1991]. 
A two-layer network that employs regu- 
larized Mahalanobis distance to extract 
hyperellipsoidal clusters was proposed 
in Mao and Jain [1994]. All these ANNs 
use a fixed number of output nodes 
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which limit the number of clusters that 
can be produced. 

5.8 Evolutionary Approaches for 
Clustering 

Evolutionary approaches, motivated by 
natural evolution, make use of evolu- 
tionary operators and a population of 
solutions to obtain the globally optimal 
partition of the data. Candidate solu- 
tions to the clustering problem are en- 
coded as chromosomes. The most com- 
monly used evolutionary operators are: 
selection, recombination, and mutation. 
Each transforms one or more input 
chromosomes into one or more output 
chromosomes. A fitness function evalu- 
ated on a chromosome determines a 
chromosome's likelihood of surviving 
into the next generation. We give below 
a high-level description of an evolution- 
ary algorithm applied to clustering. 

An Evolutionary Algorithm for 
Clustering 

(1) Choose a random population of solu- 
tions. Each solution here corre- 
sponds to a valid & -partition of the 
data. Associate a fitness value with 
each solution. Typically, fitness is 
inversely proportional to the 
squared error value. A solution with 
a small squared error will have a 
larger fitness value. 

(2) Use the evolutionary operators se- 
lection, recombination and mutation 
to generate the next population of 
solutions. Evaluate the fitness val- 
ues of these solutions. 

(3) Repeat step 2 until some termina- 
tion condition is satisfied. 

The best-known evolutionary tech- 
niques are genetic algorithms (GAs) 
[Holland 1975; Goldberg 1989], evolu- 
tion strategies (ESs) [Schwefel 1981], 
and evolutionary programming (EP) 
tPogel ct al. 1965]. Out of these three 
approaches, GAs have been most fre- 
quently used in clustering. Typically, 
solutions are binary strings in GAs. In 
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GAs, a selection operator propagates so- 
lutions from the current generation to 
the next generation based on their fit- 
ness. Selection employs a probabilistic 
scheme so that solutions with higher 
fitness have a higher probability of get- 
ting reproduced. 

There are a variety of recombination 
operators in use; crossover is the most 
popular. Crossover takes as input a pair 
of chromosomes (called parents) and 
outputs a new pair of chromosomes 
(called children or offspring) as depicted 
in Figure 20. In Figure 20, a single 
point crossover operation is depicted. It 
exchanges the segments of the parents 
across a crossover point. For example, 
in Figure 20, the parents are the binary 
strings '10110101' and '11001110'. The 
segments in the two parents after the 
crossover point (between the fourth and 
fifth locations) are exchanged to pro- 
duce the child chromosomes. Mutation 
takes as input a chromosome and out- 
puts a chromosome by complementing 
the bit value at a randomly selected 
location in the input chromosome. For 
example, the string '11111110' is gener- 
ated by applying the mutation operator 
to the second bit location in the string 
'10111110' (starting at the left). Both 
crossover and mutation are applied with 
some prespecified probabilities which 
depend on the fitness values. 

GAs represent points in the search 
space as binary strings, and rely on the 
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r operator to explore the search 
space. Mutation is used in GAs for the 
sake of completeness, that is, to make 
sure that no part of the search space is 
left unexplored. ESs and EP differ from 
the GAs .in solution representation and 
type of the mutation operator used; EP 
does not use a recombination operator, 
but only selection and mutation. Each of 
these three approaches have been used 
to solve the clustering problem by view- 
ing it as a minimization of the squared 
error criterion. Some of the theoretical 
issues such as the convergence of these 
approaches were studied in Fogel and 
Fogel [1994]. 

GAs perform a globalized search for 
solutions whereas most other clustering 
procedures perform a localized search. 
In a localized search, the solution ob- 
tained at the 'next iteration' of the pro- 
cedure is in the vicinity of the current 
solution. In this sense, the k -means al- 
gorithm, fuzzy clustering algorithms, 
ANNs used for clustering, various an- 
nealing schemes (see below), and tabu 
search are all localized search tech- 
niques. In the case of GAs, the crossover 
and mutation operators can produce 
new solutions that are completely dif- 
ferent from the current ones. We illus- 
trate this fact in Figure 21. Let us as- 
sume that the scalar X is coded using a 
5-bit binary representation, and let S a 
and S 2 be two points in the one-dimen- 
sional search space. The decimal values 
of jSj and S 2 are 8 and 31, respectively. 
Their binary representations are Si = 
01000 and S 2 = 11111. Let us apply 
the single-point crossover to these 
strings, with the crossover site falling 
between the second and third most sig- 
nificant bits as shown below. 

011000 

111111 

This will produce a new pair of points or 
chromosomes S 3 and S 4 as shown in 
Figure 21. Here, S 3 = 01111 and 




Figure 21 . GAs perform globalized search. 



Si = 11000. The corresponding deci- 
mal values are 15 and 24, respectively. 
Similarly, by mutating the most signifi- 
cant bit in the binary string 01111 (dec- 
imal 15), the binary string 11111 (deci- 
mal 31) is generated. These jumps, or 
gaps between points in successive gen- 
erations, are much larger than those 
produced by other approaches. 

Perhaps the earliest paper on the use 
of GAs for clustering is by Raghavan 
and Birchand [1979], where a GA was 
used to minimize the squared error of a 
clustering. Here, each point or chromo- 
some represents a partition ofN objects 
into K clusters and is represented by a 
K-ary string of length N. For example, 
consider six patterns— A, B, C, D, E, 
and F-and the string 101001. This six- 
bit binary {K = 2) string corresponds to 
placing the six patterns into two clus- 
ters. This string represents a two-parti- 
tion, where one cluster has the first, 
third, and sixth patterns and the second 
cluster has the remaining patterns. In 
other words, the two clusters are 
{A,C,F} and {B,D,E> (the six-bit binary 
string 010110 represents the same clus- 
tering of the six patterns). When there 
are K clusters, there are K\ different 
chromosomes corresponding to each 
if -partition of the data. This increases 
the effective search space size by a fac- 
tor of if!. Further, if crossover is applied 
on two good chromosomes, the resulting 
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offspring may be inferior in this repre- 
sentation. For example, let {A,B,C} and 
{D,E,F} be the clusters in the optimal 
2-partition of the six patterns consid- 
ered above. The corresponding chromo- 
somes are 111000 and 000111. By ap- 
plying single-point crossover at the 
location between the third and fourth 
bit positions on these two strings, we 
get 111111 and 000000 as offspring and 
both correspond to an inferior partition. 
These problems have motivated re- 
searchers to design better representa- 
tion schemes and crossover operators. 

In Bhuyan et al. [1991], an improved 
representation scheme is proposed 
where an additional separator symbol is 
used along with the pattern labels to 
represent a partition. Let the separator 
symbol be represented by *. Then the 
chromosome ACF*BDE corresponds to a 
2-partition {A,C,F} and {B,D,E}. Using 
this representation permits them to 
map the clustering problem into a per- 
mutation problem such as the traveling 
salesman problem, which can be solved 
by using the permutation crossover op- 
erators [Goldberg 1989]. This solution 
also suffers from permutation redun- 
dancy. There are 72 equivalent chromo- 
somes (permutations) corresponding to 
the same partition of the data into the 
two clusters {A,C,F} and {B,D,E}. 

More recently, Jones and Beltramo 
[1991] investigated the use of edge- 
based crossover [Whitley et al. 1989] to 
solve the clustering problem. Here, all 
patterns in a cluster are assumed to 
form a complete graph by connecting 
them with edges. Offspring are gener- 
ated from the parents so that they in- 
herit the edges from their parents. It is 
observed that this crossover operator 
takes 0(K 6 + N) time for N patterns 
and K clusters ruling out its applicabil- 
ity on practical data sets having more 
than 10 clusters. In a hybrid approach 
proposed in Babu and Murty [1993], the 
GA is used only to find good initial 
cluster centers and the k -means algo- 
rithm is applied to find the final parti- 



tion. This hybrid approach performed 
better than the GA. 

A major problem with GAs is their 
sensitivity to the selection of various 
parameters such as population size, 
crossover and mutation probabilities, 
etc. Grefenstette [Grefenstette 1986] 
has studied this problem and suggested 
guidelines for selecting these control pa- 
rameters. However, these guidelines 
may not yield good results on specific 
problems like pattern clustering. It was 
reported in Jones and Beltramo [1991] 
that hybrid genetic algorithms incorpo- 
rating problem-specific heuristics are 
good for clustering. A similar claim is 
made in Davis [1991] about the applica- 
bility of GAs to other practical prob- 
lems. Another issue with GAs is the 
selection of an appropriate representa- 
tion which is low in order and short in 
defining length. 

It is possible to view the clustering 
problem as an optimization problem 
that locates the optimal centroids of the 
clusters directly rather than finding an 
optimal partition using a GA. This view 
permits the use of ESs and EP, because 
centroids can be coded easily in both 
these approaches, as they support the 
direct representation of a solution as a 
real-valued vector. In Babu and Murty 
[1994], ESs were used on both hard and 
fuzzy clustering problems and EP has 
been used to evolve fuzzy min-max clus- 
ters [Fogel and Simpson 1993]. It has 
been observed that they perform better 
than their classical counterparts, the 
k -means algorithm and the fuzzy 
c -means algorithm. However, all of 
these approaches suffer (as do GAs and 
ANNs) from sensitivity to control pa- 
rameter selection. For each specific 
problem, one has to tune the parameter 
values to suit the application. 



5.9 Search-Based Approaches 

Search techniques used to obtain the 
optimum value of the criterion function 
are divided into deterministic and sto- 
chastic search techniques. Determinis- 
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tic search techniques guarantee an opti- 
mal partition by performing exhaustive 
enumeration. On the other hand, the 
stochastic search techniques generate a 
near-optimal partition reasonably 
quickly, and guarantee convergence to 
optimal partition asymptotically. 
Among the techniques considered so far, 
evolutionary approaches are stochastic 
and the remainder are deterministic. 
Other deterministic approaches to clus- 
tering include the branch-and-bound 
technique adopted in Koontz et al, 
[1975] and Cheng [1995] for generating 
optimal partitions. This approach gen- 
erates the optimal partition of the data 
at the cost of excessive computational 
requirements. In Rose et al. [1993], a 
deterministic annealing approach was 
proposed for clustering. This approach 
employs an annealing technique in 
which the error surface is smoothed, but 
convergence to the global optimum is 
not guaranteed. The use of determinis- 
tic annealing in proximity-mode cluster- 
ing (where the patterns are specified in 
terms of pairwise proximities rather 
than multidimensional points) was ex- 
plored in Hofmann and Buhmann 
[1997]; later work applied the determin- 
istic annealing approach to texture seg- 
mentation [Hofmann and Buhmann 
1998]. 

The deterministic approaches are typ- 
ically greedy descent approaches, 
whereas the stochastic approaches per- 
mit perturbations to the solutions in 
non-locally optimal directions also with 
nonzero probabilities. The stochastic 
search techniques are either sequential 
or parallel, while evolutionary ap- 
proaches are inherently parallel. The 
simulated annealing approach (SA) 
[Kirkpatrick et al. 1983] is a sequential 
stochastic search technique, whose ap- 
plicability to clustering is discussed in 
Klein and Dubes [1989]. Simulated an- 
nealing procedures are designed to 
avoid (or recover from) solutions which 
correspond to local optima of the objec- 
tive functions. This is accomplished by 
accepting with some probability a new 
solution for the next iteration of lower 



quality (as measured by the criterion 
function). The probability of acceptance 
is governed by a critical parameter 
called the temperature (by analogy with 
annealing in metals), which is typically 
specified in terms of a starting (first 
iteration) and final temperature value. 
Selim and Al-Sultan [1991] studied the 
effects of control parameters on the per- 
formance of the algorithm, and Baeza- 
Yates [1992] used SA to obtain near- 
optimal partition of the data. SA is 
statistically guaranteed to find the glo- 
bal optimal solution [Aarts and Korst 
1989]. A high-level outline of a SA 
based algorithm for clustering is given 
below. 

Clustering Based on Simulated 
Annealing 

(1) Randomly select an initial partition 
and P 0 , and compute the squared 
error value, E Po . Select values for 
the control parameters, initial and 
final temperatures T 0 and T f . 

(2) Select a neighbor P x of P 0 and com- 
pute its squared error value, E Pv If 
E Pl is larger than E Po , then assign 
P 1 to P a with a temperature-depen- 
dent probability. Else assign Pi to 
P 0 . Repeat this step for a fixed num- 
ber of iterations. 

(3) Reduce the value of T 0 , i.e. T 0 = 
cT 0 , where c is a predetermined 
constant. If T 0 is greater than T f , 
then go to step 2. Else stop. 

The SA algorithm can be slow in 
reaching the optimal solution, because 
optimal results require the temperature 
to be decreased very slowly from itera- 
tion to iteration. 

Tabu search [Glover 1986], like SA, is 
a method designed to cross boundaries 
of feasibility or local optimality and to 
systematically impose and release con- 
straints to permit exploration of other- 
wise forbidden regions. Tabu search 
was used to solve the clustering prob- 
lem in Al-Sultan [1995]. 
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5.10 A Comparison of Techniques 

In this section we have examined vari- 
ous deterministic and stochastic search 
techniques to approach the clustering 
problem as an optimization problem. A 
majority of these methods use the 
squared error criterion function. Hence, 
the partitions generated by these ap- 
proaches are not as versatile as those 
generated by hierarchical algorithms. 
The clusters generated are typically hy- 
perspherical in shape. Evolutionary ap- 
proaches are globalized search tech- 
niques, whereas the rest of the 
approaches are localized search tech- 
nique. ANNs and GAs are inherently 
parallel, so they can be implemented 
using parallel hardware to improve 
their speed. Evolutionary approaches 
are population-based; that is, they 
search using more than one solution at 
a time, and the rest are based on using 
a single solution at a time. ANNs, GAs, 
SA, and Tabu search (TS) are all sensi- 
tive to the selection of various learning/ 
control parameters. In theory, all four of 
these methods are weak methods [Rich 
1983] in that they do not use explicit 
domain knowledge. An important fea- 
ture of the evolutionary approaches is 
that they can find the optimal solution 
even when the criterion function is dis- 
continuous. 

An empirical study of the perfor- 
mance of the following heuristics for 
clustering was presented in Mishra and 
Raghavan [1994]; SA, GA, TS, random- 
ized branch-and-bound (RBA) [Mishra 
and Raghavan 1994], and hybrid search 
(HS) strategies [Ismail and Kamel 1989] 
were evaluated. The conclusion was 
that GA performs well in the case of 
one-dimensional data, while its perfor- 
mance on high dimensional data sets is 
not impressive. The performance of SA 
is not attractive because it is very slow. 
RBA and TS performed best. HS is good 
for high dimensional data. However, 
none of the methods was found to be 
superior to others by a significant mar- 
gin. An empirical study of k -means, SA, 
TS, and GA was presented in Al-Sultan 



and Khan [1996]. TS, GA and SA were 
judged comparable in terms of solution 
quality, and all were better than 
A-means. However, the /e-means method 
is the most efficient in terms of execu- 
tion time; other schemes took more time 
(by a factor of 500 to 2500) to partition a 
data set of size 60 into 5 clusters. Fur- 
ther, GA encountered the best solution 
faster than TS and SA; SA took more 
time than TS to encounter the best solu- 
tion. However, GA took the maximum 
time for convergence, that is, to obtain a 
population of only the best solutions, 
followed by TS and SA. An important 
observation is that in both Mishra and 
Raghavan [1994] and Al-Sultan and 
Khan [1996] the sizes of the data sets 
considered are small; that is, fewer than 
200 patterns. 

A two-layer network was employed in 
Mao and Jain [1996], with the first 
layer including a number of principal 
component analysis subnets, and the 
second layer using a competitive net. 
This network performs partitional clus- 
tering using the regularized Mahalano- 
bis distance. This net was trained using 
a set of 1000 randomly selected pixels 
from a large image and then used to 
classify every pixel in the image. Babu 
et al. [1997] proposed a stochastic con- 
nectionist approach (SCA) and com- 
pared its performance on standard data 
sets with both the SA and k -means algo- 
rithms. It was observed that SCA is 
superior to both SA and k -means in 
terms of solution quality. Evolutionary 
approaches are good only when the data 
size is less than 1000 and for low di- 
mensional data.. 

In summary, only the k -means algo- 
rithm and its ANN equivalent, the Ko- 
honen net [Mao and Jain 1996] have 
been applied on large data sets; other 
approaches have been tested, typically, 
on small data sets. This is because ob- 
taining suitable learning/control param- 
eters for ANNs, GAs, TS, and SA is 
difficult and their execution times are 
very high for large data sets. However, 
it has been shown [Selim and Ismail 
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1984] that the k -means method con- 
verges to a locally optimal solution. This 
behavior is linked with the initial seed 
selection in the k -means algorithm. So 
if a good initial partition can be ob- 
tained quickly using any of the other 
techniques, then k -means would work 
well even on problems with large data 
sets. Even though various methods dis- 
cussed in this section are comparatively 
weak, it was revealed through experi- 
mental studies that combining domain 
knowledge would improve their perfor- 
mance. For example, ANNs work better 
in classifying images represented using 
extracted features than with raw im- 
ages, and hybrid classifiers work better 
than ANNs [Mohiuddin and Mao 1994]. 
Similarly, using domain knowledge to 
hybridize a GA improves its perfor- 
mance [Jones and Beltramo 1991]. So it 
may be useful in general to use domain 
knowledge along with approaches like 
GA, SA, ANN, and TS. However, these 
approaches (specifically, the criteria 
functions used in them) have a tendency 
to generate a partition of hyperspheri- 
cal clusters, and this could be a limita- 
tion. For example, in cluster-based doc- 
ument retrieval, it was observed that 
the hierarchical algorithms performed 
better than the partitional algorithms 
[Rasmussen 1992]. 



5.1 1 Incorporating Domain Constraints in 
Clustering 

As a task, clustering is subjective in 
nature. The same data set may need to 
be partitioned differently for different 
purposes. For example, consider a 
whale, an elephant, and a tuna fish 
[Watanabe 1985]. Whales and elephants 
form a cluster of mammals. However, if 
the user is interested in partitioning 
them based on the concept of living in 
water, then whale and tuna fish are 
clustered together. Typically, this sub- 
jectivity is incorporated into the cluster- 
ing criterion by incorporating domain 
knowledge in one or more phases of 
clustering. 



Every clustering algorithm uses some 
type of knowledge either implicitly or 
explicitly. Implicit knowledge plays a 
role in (1) selecting a pattern represen- 
tation scheme (e.g., using one's prior 
experience to select and encode fea- 
tures), (2) choosing a similarity measure 
(e.g., using the Mahalanobis distance 
instead of the Euclidean distance to ob- 
tain hyperellipsoidal clusters), and (3) 
selecting a grouping scheme (e.g., speci- 
fying the k -means algorithm when it is 
known that clusters are hyperspheri- 
cal). Domain knowledge is used implic- 
itly in ANNs, GAs, TS, and SA to select 
the control/learning parameter values 
that affect the performance of these al- 
gorithms. 

It is also possible to- use explicitly 
available domain knowledge to con- 
strain or guide the clustering process. 
Such specialized clustering algorithms 
have been used in several applications. 
Domain concepts can play several roles 
in the clustering process, and a variety 
of choices are available to the practitio- 
ner. At one extreme, the available do- 
main concepts might easily serve as an 
additional feature (or several), and the 
remainder of the procedure might be 
otherwise unaffected. At the other ex- 
treme, domain concepts might be used 
to confirm or veto a decision arrived at 
independently by a traditional cluster- 
ing algorithm, or used to affect the com- 
putation of distance in a clustering algo- 
rithm employing proximity. The 
incorporation of domain knowledge into 
clustering consists mainly of ad hoc ap- 
proaches with little in common; accord- 
ingly, our discussion of the idea will 
consist mainly of motivational material 
and a brief survey of past work. Ma- 
chine learning research and pattern rec- 
ognition research intersect in this topi- 
cal area, and the interested reader is 
referred to the prominent journals in 
machine learning (e.g., Machine Learn- 
ing, J. ofAI Research, or Artificial Intel- 
ligence) for a fuller treatment of this 
topic. 
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As documented in Cheng and Fu 
[1985], rules in an expert system may 
be clustered to reduce the size of the 
knowledge base. This modification of 
clustering was also explored in the do- 
mains of universities, congressional vot- 
ing records, and terrorist events by Leb- 
owitz [1987]. 

5.11.1 Similarity Computation. Con- 
ceptual knowledge was used explicitly 
in the similarity computation phase in 
Michalski and Stepp [1983]. It was as- 
sumed that the pattern representations 
were available and the dynamic cluster- 
ing algorithm [Diday 1973] was used to 
group patterns. The clusters formed 
were described using conjunctive state- 
ments in predicate logic. It was stated 
in Stepp and Michalski [1986] and 
Michalski and Stepp [1983] that the 
groupings obtained by the conceptual 
clustering are superior to those ob- 
tained by the numerical methods for 
clustering. A critical analysis of that 
work appears in Dale [1985], and it was 
observed that monothetic divisive clus- 
tering algorithms generate clusters that 
can be described by conjunctive state- 
ments. For example, consider Figure 8. 
Four clusters in this figure, obtained 
using a monothetic algorithm, can be 
described by using conjunctive concepts 
as shown below: 

Cluster 1: [X £ a] a [Y s b] 

Cluster 2: [X < a] a [Y > b] 

Cluster 3: [X > a] a [Y > c] 

Cluster 4: [X > a] a [Y < c] 

where a is the Boolean conjunction 
('and') operator, and a, b, and c are 
constants. 

5.11.2 Pattern Representation. It was 
shown in Srivastava and Murty [1990] 
that by using knowledge in the pattern 
representation phase, as is implicitly 
done in numerical taxonomy ap- 
proaches, it is possible to obtain the 
same partitions as those generated by 
conceptual clustering. In this sense, 



conceptual clustering and numerical 
taxonomy are not diametrically oppo- 
site, but are equivalent. In the case of 
conceptual clustering, domain knowl- 
edge is explicitly used in interpattern 
similarity computation, whereas in nu- 
merical taxonomy it is implicitly as- 
sumed that pattern representations are 
obtained using the domain knowledge. 

5.11.3 Cluster Descriptions. Typi- 
cally, in knowledge-based clustering, 
both the clusters and their descriptions 
or characterizations are generated 
[Fisher and Langley 1985]. There are 
some exceptions, for instance,, Gowda 
and Diday [1992], where only clustering 
is performed and no descriptions are 
generated explicitly. In conceptual clus- 
tering, a cluster of objects is described 
by a conjunctive logical expression 
[Michalski and Stepp 1983]. Even 
though a conjunctive statement is one of 
the most common descriptive forms 
used by humans, it is a limited form. In 
Shekar et al. [1987], functional knowl- 
edge of objects was used to generate 
more intuitively appealing cluster de- 
scriptions that employ the Boolean im- 
plication operator. A system that repre- 
sents clusters probabilistically was 
described in Fisher [1987]; these de- 
scriptions are more general than con- 
junctive concepts, and are well-suited to 
hierarchical classification domains (e.g., 
the animal species hierarchy). A concep- 
tual clustering system in which cluster- 
ing is done first is described in Fisher 
and Langley [1985]. These clusters are 
then described using probabilities. A 
similar scheme was described in Murty 
and Jain [1995], but the descriptions 
are logical expressions that employ both 
conjunction and disjunction. 

An important characteristic of concep- 
tual clustering is that it is possible to 
group objects represented by both qual- 
itative and quantitative features if the 
clustering leads to a conjunctive con- 
cept. For example, the concept cricket 
ball might be represented as 
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Figure 22. Functional knowledge. 



color = red a (shape = sphere) 
a (make = leather) 
a (radius = 1.4 inches), 

where radius is a quantitative feature 
and the rest are all qualitative features. 
This description is used to describe a 
cluster of cricket balls. In Stepp and 
Michalski [1986], a graph (the goab-de- 
pendency network) was used to group 
structured objects. In Shekar et al. 
[1987] functional knowledge was used 
to group man-made objects. Functional 
knowledge was represented using 
and/or trees [Rich 1983]. For example, 
the function cooking shown in Figure 22 
can be decomposed into functions like 
holding and heating the material in a 
liquid medium. Each man-made object 
has a primary function for which it is 
produced. Further, based on its fea- 
tures, it may serve additional functions. 
For example, a book is meant for read- 
ing, but if it is heavy then it can also be 
used as a paper weight. In Sutton et al. 
[1993], object functions were used to 
construct generic recognition systems. 

5.11.4 Pragmatic Issues. Any imple- 
mentation of a system that explicitly 
incorporates domain concepts into a 
clustering technique has to address the 
following important pragmatic issues: 

(1) Representation, availability and 
completeness of domain concepts. 

(2) Construction of inferences using the 
knowledge. 

(3) Accommodation of changing or dy- 
namic knowledge. 



In some domains, complete knowledge 
is available explicitly. For example, the 
ACM Computing Reviews classification 
tree used in Murty and Jain [1995] is 
complete and is explicitly available for 
use. In several domains, knowledge is 
incomplete and is not available explic- 
itly. Typically, machine learning tech- 
niques are used to automatically extract 
knowledge, which is a difficult and chal- 
lenging problem. The most prominently 
used learning method is "learning from 
examples" [Quinlan 1990], This is an 
inductive learning scheme used to ac- 
quire knowledge from examples of each 
of the classes in different domains. Even 
if the knowledge is available explicitly, 
it is difficult to find out whether it is 
complete and sound. Further, it is ex- 
tremely difficult to verify soundness 
and completeness of knowledge ex- 
tracted from practical data sets, be- 
cause such knowledge cannot be repre- 
sented in propositional logic. It is 
possible that both the data and knowl- 
edge keep changing with time. For ex- 
ample, in a library, new books might get 
added and some old books might be 
deleted from the collection with time. 
Also, the classification system (knowl- 
edge) employed by the library is up- 
dated periodically. 

A major problem with knowledge- 
based clustering is that it has not been 
applied to large data sets or in domains 
with large knowledge bases. Typically, 
the number of objects grouped was less 
than 1000, and number of rules used as 
a part of the knowledge was less than 
100. The most difficult problem is to use 
a very large knowledge base for cluster- 
ing objects in several practical problems 
including data mining, image segmenta- 
tion, and document retrieval. 

5.12 Clustering Large Data Sets 

There are several applications where it 
is necessary to cluster a large collection 
of patterns. The definition of 'large' has 
varied (and will continue to do so) with 
changes in technology (e.g., memory and 
processing time). In the 1960s, 'large' 
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meant several thousand patterns [Ross 
1968]; now, there are applications 
where millions of patterns of high di- 
mensionality have to be clustered. For 
example, to segment an image of size 
500 X 500 pixels, the number of pixels 
to be clustered is 250,000. In document 
retrieval and information filtering, mil- 
lions of patterns with a dimensionality 
of more than 100 have to be clustered to 
achieve data abstraction. A majority of 
the approaches and algorithms pro- 
posed in the literature cannot handle 
such large data sets. Approaches based 
on genetic algorithms, tabu search and 
simulated annealing are optimization 
techniques and are restricted to reason- 
ably small data sets. Implementations 
of conceptual clustering optimize some 
criterion functions and are typically 
computationally expensive. 

The convergent k -means algorithm 
and its ANN equivalent, the Kohonen 
net, have been used to cluster large 
data sets [Mao and Jain 1996]. The rea- 
sons behind the popularity of the 
k -means algorithm are: 

(1) Its time complexity is O(nkl), 
where n is the number of patterns, 
k is the number of clusters, and / is 
the number of iterations taken by 
the algorithm to converge. Typi- 
cally, k and I are fixed in advance 
and so the algorithm has linear time 
complexity in the size of the data set 
[Day 1992]. 

(2) Its space complexity is 0(k + n). It 
requires additional space to store 
the data matrix. It is possible to 
store the data matrix in a secondary 
memory and access each pattern 
based on need. However, this 
scheme requires a huge access time 
because of the iterative nature of 
the algorithm, and as a consequence 
processing time increases enor- 
mously. 

(3) It is order-independent; for a given 
initial seed set of cluster centers, it 
generates the same partition of the 



Table I. Complexity of Clustering Algorithms 



™ . ■ >i lime apace 

Clustering Algorithm Comp i exity Complexity 



leader 


0(kn) 


0(k) 


k -means 




0{k) 


ISODATA 


0(nkl) 


Oik) 


shortest spanning path 


0(« 2 ) 


0(n) 


single-line 


0(n 2 log n) 


0(n 2 ) 


complete-line 


0{n 2 log n) 


0(rc 2 ) 



data irrespective of the order in 
which the patterns are presented to 
the algorithm. 

However, the k -means algorithm is sen- 
sitive to initial seed selection and even 
in the best case, it can produce only 
hyperspherical clusters. 

Hierarchical algorithms are more ver- 
satile. But they have the following dis- 
advantages: 

(1) The time complexity of hierarchical 
agglomerative algorithms is 0(n 2 

log n) [Kurita 1991]. It is possible 
to obtain single-link clusters using 
an MST of the data, which can be 
constructed in 0(n log 2 n) time for 
two-dimensional data [Choudhury 
and Murty 1990]. 

(2) The space complexity of agglomera- 
tive algorithms is 0(n 2 ). This is be- 
cause a similarity matrix of size 
n X n has to be stored. To cluster 
every pixel in a 100 X 100 image, 
approximately 200 megabytes of 
storage would be required (assuning 
single-precision storage of similari- 
ties). It is possible to compute the 
entries of this matrix based on need 
instead of storing them (this would 
increase the algorithm's time com- 
plexity [Anderberg 1973]). 

Table I lists the time arid space com- 
plexities of several well-known algo- 
rithms. Here, n is the number of pat- 
terns to be clustered, k is the number of 
clusters, and I is the number of itera- 
tions. 
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A possible solution to the problem of 
clustering large data sets while only 
marginally sacrificing the versatility of 
clusters is to implement more efficient 
variants of clustering algorithms. A hy- 
brid approach was used in Ross [1968], 
where a set of reference points is chosen 
as in the A -means algorithm, and each 
of the remaining data points is assigned 
to one or more reference points or clus- 
ters. Minimal spanning trees (MST) are 
obtained for each group of points sepa- 
rately. These MSTs are merged to form 
an approximate global MST. This ap- 
proach computes similarities between 
only a fraction of all possible pairs of 
points. It was shown that the number of 
similarities computed for 10,000 pat- 
terns using this approach is the same as 
the total number of pairs of points in a 
collection of 2,000 points. Bentley^and 
Friedman [1978] contains an algorithm 
that can compute an approximate MST 
in 0(n log n) time. A scheme to gener- 
ate an approximate dendrogram incre- 
mentally in 0(n log n) time was pre- 
sented in Zupan [1982], while 
Venkateswarlu and Raju [1992] pro- 
posed an algorithm to speed up the ISO- 
DATA clustering algorithm. A study of 
the approximate single-linkage cluster 
analysis of large data sets was reported 
in Eddy et al. [1994]. In that work, an 
approximate MST was used to form sin- 
gle-link clusters of a data set of size 
40,000. 

The emerging discipline of data min- 
ing (discussed as an application in Sec- 
tion 6) has spurred the development of 
new algorithms for clustering large data 
sets. Two algorithms of note are the 
CLARANS algorithm developed by Ng 
and Han [1994] and the BIRCH algo- 
rithm proposed by Zhang et al. [1996]. 
CLARANS (Clustering Large Applica- 
tions based on RANdom Search) identi- 
fies candidate cluster centroids through 
analysis of repeated random samples 
from the original data/Because of the 
use of random sampling, the time com- 
plexity is O(n) for a pattern set of n 
elements. The BIRCH algorithm (Bal- 



anced Iterative Reducing and Cluster- 
ing) stores summary information about 
candidate clusters in a dynamic tree 
data structure. This tree hierarchically 
organizes the clusterings represented at 
the leaf nodes. The tree can be rebuilt 
when a threshold specifying cluster size 
is updated manually, or when memory 
constraints force a change in this 
threshold. This algorithm, like CLAR- 
ANS, has a time complexity linear in 
the number of patterns. 

The algorithms discussed above work 
on large data sets, where it is possible 
to accommodate the entire pattern set 
in the main memory. However, there 
are applications where the entire data 
set cannot be stored in the main mem- 
ory because of its size. There are cur- 
rently three possible approaches to 
solve this problem. 

(1) The pattern set can be stored in a 
secondary memory and subsets of 
this data clustered independently, 
followed by a merging step to yield a 
clustering of the entire pattern set. 
We call this approach the divide and 
conquer approach. 

(2) An incremental clustering algorithm 
can be employed. Here, the entire 
data matrix is stored in a secondary 
memory and data items are trans- 
ferred to the main memory one at a 
time for clustering. Only the cluster 
representations are stored in the 
main memory to alleviate the space 
limitations. 

(3) A parallel implementation of a clus- 
tering algorithm may be used. We 
discuss these approaches in the next 
three subsections. 

5.12.1 Divide and Conquer Approach. 
Here, we store the entire pattern matrix 
of size n X d in a secondary storage 
space (e.g., a disk file). We divide this 
data into p blocks, where an optimum 
value of p can be chosen based on the 
clustering algorithm used [Murty and 
Krishna 1980]. Let us assume that we 
have n/p patterns in each of the blocks. 
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Figure 23. Divide and conquer approach to 
clustering. 



We transfer each of these blocks to the 
main memory and cluster it into k clus- 
ters using a standard algorithm. One or 
more representative samples from each 
of these clusters are stored separately; 
we have pk of these representative pat- 
terns if we choose one representative 
per cluster. These pk representatives 
are further clustered into k clusters and 
the cluster labels of these representa- 
tive patterns are used to relabel the 
original pattern matrix. We depict this 
two-level algorithm in Figure 23. It is 
possible to extend this algorithm to any 
number of levels; more levels are re- 
quired if the data set is very large and 
the main memory size is very small 
[Murty and Krishna 1980]. If the single- 
link algorithm is used to obtain 5 clus- 
ters, then there is a substantial savings 
in the number of computations as 
shown in Table II for optimally chosen p 
when the number of clusters is fixed at 
5. However, this algorithm works well 
only when the points in each block are 
reasonably homogeneous which is often 
satisfied by image data. 

A two-level strategy for clustering a 
data set containing 2,000 patterns was 
described in Stahl [1986]. In the first 
level, the data set is loosely clustered 
into a large number of clusters using 
the leader algorithm. Representatives 
from these clusters, one per cluster, are 
the input to the second level clustering, 
which is obtained using Ward's hierar- 
chical method. 
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Table II. Number of Distance Computations (n) 
for the Single-Link Clustering Algorithm and a 
Two-Level Divide and Conquer Algorithm 





Single-link 


P 


Two-level 


100 
500 
100 
10,000 


4,950 
124,750 
499,500 
49,995,000 


2 
4 
10 


1200 
10,750 
31,500 
1,013,750 



5.12.2 Incremental Clustering. In- 
cremental clustering is based on the 
assumption that it is possible to con- 
sider patterns one at a time and assign 
them to existing clusters. Here, a new 
data item iB assigned to a cluster with- 
out affecting the existing clusters signif- 
icantly. A high level description of a 
typical incremental clustering algo- 
rithm is given below. 

An Incremental Clustering Algo- 
rithm 

(1) Assign the first data item to a clus- 
ter. 

(2) Consider the next data item. Either 
assign this item to one of the exist- 
ing clusters or assign it to a new 
cluster. This assignment is done 
based on some criterion, e.g. the dis- 
tance between the new item and the 
existing cluster centroids. 

(3) Repeat step 2 till all the data items 
are clustered. 

The major advantage with the incre- 
mental clustering algorithms is that it 
is not necessary to store the entire pat- 
tern matrix in the memory. So, the 
space requirements of incremental algo- 
rithms are very small. Typically, they 
are noniterative. So their time require- 
ments are also small. There are several 
incremental clustering algorithms: 
(1) The leader clustering algorithm 
[Hartigan 1975] is the simplest in 
terms of time complexity which is 
0(nk). It has gained popularity be- 
cause of its neural network imple- 
mentation, the ART network [Car- 
penter and Grossberg 19901. It is 
very easy to implement as it re- 
quires only O(k) space. 
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(2) The shortest spanning path (SSP) 
algorithm [Slagle et ai. 1975] was 
originally proposed for data reorga- 
nization and was successfully used 
in automatic auditing of records 
[Lee et al. 1978]. Here, SSP algo- 
rithm was used to cluster 2000 pat- 
terns using. 18 features. These clus- 
ters are used to estimate missing 
feature values in data items and to 
identify erroneous feature values. 

(3) The cobweb system [Fisher 1987] is 
an incremental conceptual cluster- 
ing algorithm. It has been success- 
fully used in engineering applica- 
tions [Fisher et al. 1993]. 

(4) An incremental clustering algorithm 
for dynamic information processing 
was presented in Can [1993]. The 
motivation behind this work is that, 
in dynamic databases, itema,might 
get added and deleted over time. 
These changes should be reflected in 
the partition generated without sig- 
nificantly affecting the current clus- 
ters. This algorithm was used to 
cluster incrementally an INSPEC 
database of 12,684 documents corre- 
sponding to computer science and 
electrical engineering. 

Order-independence is an important 
property of clustering algorithms. An 
algorithm is order-independent if it gen- 
erates the same partition for any order 
in which the data is presented. Other- 
wise, it is order-dependent. Most of the 
incremental algorithms presented above 
are order-dependent. We illustrate this 
order-dependent property in Figure 24 
where there are 6 two-dimensional ob- 
jects labeled 1 to 6. If we present these 
patterns to the leader algorithm in the 
order 2,1,3,5,4,6 then the two clusters 
obtained are shown by ellipses. If the 
order is 1,2,6,4,5,3, then we get a two- 
partition as shown by the triangles. The 
SSP algorithm, cobweb, and the algo- 
rithm in Can [1993] are all order-depen- 
dent. 

5.12.3 Parallel Implementation. Re- 
cent work [Judd et al. 1996] demon- 
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Figure 24. The leader algorithm is order 
dependent. 

strates that a combination of algorith- 
mic enhancements to a clustering 
algorithm and distribution of the com- 
putations over a network of worksta- 
tions can allow an entire 512 X 512 
image to be clustered in a few minutes. 
Depending on the clustering algorithm 
in use, parallelization of the code and 
replication of data for efficiency may 
yield large benefits. However, a global 
shared data structure, namely the clus- 
ter membership table, remains and 
must be managed centrally or replicated 
and synchronized periodically. The 
presence or absence of robust, efficient 
parallel clustering techniques will de- 
termine the success or failure of cluster 
analysis in large-scale data mining ap- 
plications in the future. 

6. APPLICATIONS 

Clustering algorithms have been used 
in a large variety of applications [Jain 
and Dubes 1988; Rasmussen 1992; 
Oehler and Gray 1995; Fisher et al. 
1993]. In this section, we describe sev- 
eral applications where clustering has 
been employed as an essential step. 
These areas are: (1) image segmenta- 
tion, (2) object and character recogni- 
tion, (3) document retrieval, and (4) 
data mining. 

6.1 Image Segmentation Using Clustering 

Image segmentation is a fundamental 
component' in many computer vision 
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Figure 26. Feature representation for clustering. Image measurements and positions are transformed 
to features. Clusters in feature space correspond to image segments. 



applications, and can be addressed as a 
clustering problem [Rosenfeld and Kak 
1982]. The segmentation of the image(s) 
presented to an image analysis system 
is critically dependent on the scene to 
be sensed, the imaging geometry, con- 
figuration, and sensor used to transduce 
the scene into a digital image, and ulti- 
mately the desired output (goal) of the 
system. 

The applicability of clustering meth- 
odology to the image segmentation 
problem was recognized over three de- 
cades ago, and the paradigms underly- 
ing the initial pioneering efforts are still 
in use today. A recurring theme is to 
define feature vectors at every image 
location (pixel) composed of both func- 
tions of image intensity and functions of 
the pixel location itself. This basic idea, 
depicted in Figure 25, has been success- 
fully used for intensity images (with or 
without texture), range (depth) images 
and multispectral images. 

6.1.1 Segmentation. An image seg- 
mentation is typically defined as an ex- 
haustive partitioning of an input image 
into regions, each of which is considered 
to be homogeneous with respect to some 
image property of interest (e.g., inten- 
sity, color, or texture) [Jain et al. 1995]. 
If 

S> = {x v ,i = l...N r ,j = l...N c } 



is the input image with N r rows and N c 
columns and measurement value x j; at 
pixel then the segmentation can 

be expressed as if = {S 1} . . . S k }, with 
the Ith segment 

Si = {{i h JO, ■ ■ ■ (W/*)> 

consisting of a connected subset of the 
pixel coordinates. No two segments 
share any pixel locations (Sj Pi Sj , — 0 
Vi f j), and the union of all segments 
covers the entire image (U* =1 <S t - = 
{1. . . N r ) X {1. . . N e }). Jain and 
Dubes [1988], after Fu and Mui [1981] 
identified three techniques for produc- 
ing segmentations from input imagery: 
region-based, edge-based, or cluster- 
based. 

Consider the use of simple gray level 
thresholding to segment a high-contrast 
intensity image. Figure 26(a) shows a 
grayscale image of a textbook's bar code 
scanned on a flatbed scanner. Part b 
shows the results of a simple threshold- 
ing operation designed to separate the 
dark and light regions in the bar code 
area. Binarization steps like this are 
often performed in character recogni- 
tion systems. Thresholding in effect 
'clusters' the image pixels into two 
groups based on the one-dimensional 
intensity measurement [Rosenfeld 1969; 
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Figure 26. Binarization via thresholding, (a): Original grayscale image, (b): Gray-level histogram, (c): 
Results of thresholding. 



Dunn et al. 1974]. A postprocessing step 
separates the classes into connected re- 
gions. While simple gray level thresh- 
olding is adequate in some carefully 
controlled image acquisition environ- 
ments and much research has been de- 
voted to appropriate methods for 
thresholding [Weszka 1978; Trier and 
Jain 1995], complex images require 
more elaborate segmentation tech- 
niques. .; 

Many segmenters use measurements 
which are both spectral (e.g., the multi- 
spectral scanner used in remote sens- 
ing) and spatial (based on the pixel's 
location in the image plane). The mea- 
surement at each pixel hence corre- 
sponds directly to our concept of a pat- 
tern. 



6.1.2 Image Segmentation Via Clus- 
tering. The application of local feature 
clustering to segment gray-scale images 
was documented in Schachter et al. 
[1979]. This paper emphasized the ap- 
propriate selection of features at each 
pixel rather than the clustering method- 
ology, and proposed the use of image 
plane coordinates (spatial information) 
as additional features to be employed in 
clustering-based segmentation. The goal 
of clustering was to obtain a sequence of 
hyperellipsoidal clusters starting with 
cluster centers positioned at maximum 
density locations in the pattern space, 
and growing clusters about these cen- 
ters until a x 2 test for goodness of fit 
was violated. A variety of features were 
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discussed and applied to both grayscale 
and color imagery. 

An agglomerative clustering algo- 
rithm was applied in Silverman and 
Cooper [1988] to the problem of unsu- 
pervised learning of clusters of coeffi- 
cient vectors for two image models that 
correspond to image segments. The first 
image model is polynomial for the ob- 
served image measurements; the as- 
sumption here is that the image is a 
collection of several adjoining graph 
surfaces, each a polynomial function of 
the image plane coordinates, which are 
sampled on the raster grid to produce 
the observed image. The algorithm pro- 
ceeds by obtaining vectors of coefficients 
of least-squares fits to the data in M 
disjoint image windows. An agglomera- 
tive clustering algorithm merges (at 
each step) the two clusters that have a 
minimum global between-cluster Ma- 
halanobis distance. The same frame- 
work was applied to segmentation of 
textured images, but for such images 
the polynomial model was inappropri- 
ate, and a parameterized Markov Ran- 
dom Field model was assumed instead. 

Wu and Leahy [1993] describe the 
application of the principles of network 
flow to unsupervised classification, 
yielding a novel hierarchical algorithm 
for clustering. In essence, the technique 
views the unlabeled patterns as nodes 
in a graph, where the weight of an edge 
(i.e., its capacity) is a measure of simi- 
larity between the corresponding nodes. 
Clusters are identified by removing 
edges from the graph to produce con- 
nected disjoint subgraphs. In image seg- 
mentation, pixels which are 4-neighbors 
or 8-neighbors in the image plane share 
edges in the constructed adjacency 
graph, and the weight of a graph edge is 
based on the strength of a hypothesized 
image edge between the pixels involved 
(this strength is calculated using simple 
derivative masks). Hence, this seg- 
menter works by finding closed contours 
in the image, and is best labeled edge- 
based rather than region-based. 
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In Vinod et al. [1994], two neural 
networks are designed to perform pat- 
tern clustering when combined. A two- 
layer network operates on a multidi- 
mensional histogram of the data to 
identify 'prototypes' which are used to 
classify the input patterns into clusters. 
These prototypes are fed to the classifi- 
cation network, another two-layer net- 
work operating on the histogram of the 
input data, but are trained to have dif- 
fering weights from the prototype selec- 
tion network. In both networks, the his- 
togram of the image is used to weight 
the contributions of patterns neighbor- 
ing the one under consideration to the 
location of prototypes or the ultimate 
classification; as such, it is likely to be 
more robust when compared to tech- 
niques which assume an underlying 
parametric density function for the pat- 
tern classes. This architecture was 
tested on gray-scale and color segmen- 
tation problems. 

Joliori et al. [1991] describe a process 
for extracting clusters sequentially from 
the input pattern set by identifying hy- 
perellipsoidal regions (bounded by loci 
of constant Mahalanobis distance) 
which contain a specified fraction of the 
unclassified points in the set. The ex- 
tracted regions are compared against 
the best-fitting multivariate Gaussian 
density through a Kolmogorov-Smirnov 
test, and the fit quality is used as a 
figure of merit for selecting the 'best' 
region at each iteration. The process 
continues until a stopping criterion is 
satisfied. This procedure was applied to 
the problems of threshold selection for 
multithreshold segmentation of inten- 
sity imagery and segmentation of range 
imagery. 

Clustering techniques have also been 
successfully used for the segmentation 
of range images, which are a popular 
source of input data for three-dimen- 
sional object recognition systems [Jain 
and Flynn 1993]. Range sensors typi- 
cally return raster images with the 
measured value at each pixel being the 
coordinates of a 3D location in space. 
These 3D positions can be understood 
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as the locations where rays emerging 
from the image plane locations in a bun- 
dle intersect the objects in front of the 
sensor. 

The local feature clustering concept is 
particularly attractive for range image 
segmentation since (unlike intensity 
measurements) the measurements at 
each pixel have the same units (length); 
this would make ad hoc transformations 
or normalizations of the image features 
unnecessary if their goal is to impose 
equal scaling oh those features. How- 
ever, range image segmenters often add 
additional measurements to the feature 
space, removing this advantage. 

A range image segmentation system 
described in Hoffman and Jain [1987] 
employs squared error clustering in a 
six-dimensional feature space as a 
source of an "initial" segmentation 
which is refined (typically by merging 
segments) into the output segmenta- 
tion. The technique was enhanced in 
Flynn and Jain [1991] and used in a 
recent systematic comparison of range 
image segmenters THoover et al. 1996]; 
as such, it is probably one of the long- 
est-lived range segmenters which has 
performed well on a large variety of 
range images. 

This segmenter works as follows. At 
each pixel (i, j) in the input range im- 
age, the corresponding 3D measurement 
is denoted (*;,-, y ; ,-, Zy), where typically 
Xij is a linear function of j (the column 
number) and Vy is a linear function of i 
(the row number). A k X k neighbor- 
hood of (i, j) is used to estimate the 3D 
surface normal n 0 - = {n x ijt n'j) at 
(ij), typically by finding the least- 
squares planar fit to the 3D points in 
the neighborhood. The feature vector for 
the pixel at is the six-dimensional 
measurement (xy, y^, z t j, n* Jt n y ijt n\j), 
and a candidate segmentation is found 
by clustering these feature vectors. For 
practical reasons, not every pixel's fea- 
ture vector is used in the clustering 
procedure; typically 1000 feature vec- 
tors are chosen by subsampling. 



The CLUSTER algorithm [Jain and 
Dubes 1988] was used to obtain seg- 
ment labels for each pixel. CLUSTER is 
an enhancement of the k -means algo- 
rithm; it has the ability to identify sev- 
eral clusterings of a data set, each with 
a different number of clusters. Hoffman 
and Jain [1987] also experimented with 
other clustering techniques (e.g., com- 
plete-link, single-link, graph-theoretic, 
and other squared error algorithms) and 
found CLUSTER to provide the best 
combination of performance and accu- 
racy. An additional advantage of CLUS- 
TER is that it produces a sequence of 
output clusterings (i.e., a 2-cluster solu- 
tion up through a K max -cluster solution 
where K max is specified by the user and 
is typically 20 or so); each clustering in 
this sequence yields a clustering statis- 
tic which combines between-cluster sep- 
aration and within-cluster scatter. The 
clustering that optimizes this statistic 
is chosen as the best one. Each pixel in 
the range image is assigned the seg- 
ment label of the nearest cluster center. 
This minimum distance classification 
step is not guaranteed to produce seg- 
ments which are connected in the image 
plane; therefore, a connected compo- 
nents, labeling algorithm allocates new 
labels for disjoint regions that were 
placed in the same cluster. Subsequent 
operations include surface type tests, 
merging of adjacent patches using a test 
for the presence of crease or jump edges 
between adjacent segments, and surface 
parameter estimation. 

Figure 27 shows this processing ap- 
plied to a range image. Part a of the 
figure shows the input range image; 
part b shows the distribution of surface 
normals. In part c, the initial segmenta- 
tion returned by CLUSTER and modi- 
fied to guarantee connected segments is 
shown. Part d shows the final segmen- 
tation produced by merging adjacent 
patches which do not have a significant 
crease edge between them. The final 
clusters reasonably represent distinct 
surfaces present in this complex object. 
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(d) 

n using clustering, (a): Input range image, (b): Surface normals 



f« "elected image pixels, (c): Initial segmentation (19 cluster solution) returned by CLUSTER using 
1000 six-dimensional samples from the image as a pattern set. (d): Final segmentation (8 segments) 



produced by postprocessing. 



The analysis of textured images has 
been of interest to researchers for sev- 
eral years. Texture segmentation tech- 
niques have been developed using a va- 
riety of texture models and image 
operations. In Nguyen and Cohen 
[1993], texture image segmentation was 
addressed by modeling the image as a 
hierarchy of two Markov Random 
Fields, obtaining some simple statistics 
from each image block to form a feature 
vector, and clustering these blocks us- 
ing a fuzzy isT-means clustering method. 
The clustering procedure here is modi- 
fied to jointly estimate the number of 



clusters as well as the fuzzy member- 
ship of each feature vector to the vari- 
ous clusters. 

A system for segmenting texture im- 
ages was described in Jain and Far- 
rokhnia [1991]; there, Gabor filters 
were used to obtain a set of 28 orienta- 
tion- and scale-selective features that 
characterize the texture in the neigh- 
borhood of each pixel. These 28 features 
are reduced to a smaller number 
through a feature selection procedure, 
and the resulting features are prepro- 
cessed and then clustered using the 
CLUSTER program. An index statistic 
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(a) 

Figure 28. Texture image segmentation results, 
solution produced by CLUSTER with pixel coordina 



[Dubes 1987] is used to select the best 
clustering. Minimum distance classifi- 
cation is used to label each of the origi- 
nal image pixels. This technique was 
tested on several texture mosaics in- 
cluding the natural Brodatz textures 
and synthetic images. Figure 28(a) 
shows an input texture mosaic consist- 
ing of four of the popular Brodatz tex- 
tures [Brodatz 1966]. Part b shows the 
segmentation produced when the Gabor 
filter features are augmented to contain 
spatial information (pixel coordinates). 
This Gabor filter based technique has 
proven very powerful and has been ex- 
tended to the automatic segmentation of 
text in documents [Jain and Bhatta- 
charjee 1992] and segmentation of ob- 
jects in complex backgrounds [Jain et 
al. 1997]. 

Clustering can be used as a prepro- 
cessing stage to identify pattern classes 
for subsequent supervised classifica- 
tion. Taxt and Lundervold [1994] and 
Lundervold et al. [1996] describe a par- 
titional clustering algorithm and a man- 
ual labeling technique to identify mate- 
rial classes (e.g., cerebrospinal fluid, 
white matter, striated muscle, tumor) in 
registered images of a human head ob- 
tained at five different magnetic reso- 



(b) 

(a): Four-class texture mosaic, (b): Four-cluster 
es included in the feature set. 

nance imaging channels (yielding a five- 
dimensional feature vector at each 
pixel). A number of clusterings were 
obtained and combined with domain 
knowledge (human expertise) to identify 
the different classes. Decision rules for 
supervised classification were based on 
these obtained classes. Figure 29(a) 
shows one channel of an input multi- 
spectral image; part b shows the 9-clus- 
ter result. 

The k -means algorithm was applied 
to the segmentation of LANDSAT imag- 
ery in Solberg et al. [1996]. Initial clus- 
ter centers were chosen interactively by 
a trained operator, and correspond to 
land-use classes such as urban areas, 
soil (vegetation-free) areas, forest, 
grassland, and water. Figure 30(a) 
shows the input image rendered as 
grayscale; part b shows the result of the 
clustering procedure. 

6.1.3 Summary. In this section, the 
application of clustering methodology to 
image segmentation problems has been 
motivated and surveyed. The historical 
record shows that clustering is a power- 
ful tool for obtaining classifications of 
image pixels. Key issues in the design of 
any clustering-based segmenter are the 
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choice of pixel measurements (features) 
and dimensionality of the feature vector 
(i.e., should the feature vector contain 
intensities, pixel positions, model pa- 
rameters, filter outputs?), a measure of 
similarity which is appropriate for the 
selected features and the application do- 
main, the identification of a clustering 
algorithm, the development of strate- 
gies for feature and data reduction (to 
avoid the "curse of dimensionality" and 
the computational burden of classifying 



large numbers of patterns and/or fea- 
tures), and the identification of neces- 
sary pre- and post-processing tech- 
niques (e.g., image smoothing and 
minimum distance classification). The 
use of clustering for segmentation dates 
back to the 1960s, and new variations 
continue to emerge in the literature. 
Challenges to the more successful use of 
clustering include the high computa- 
tional complexity of many clustering al- 
gorithms and their incorporation of 
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strong assumptions (often multivariate 
Gaussian) about the multidimensional 
shape of clusters to be obtained. The 
ability of new clustering procedures to 
handle concepts and semantics in classi- 
fication (in addition to numerical mea- 
surements) will be important for certain 
applications [Michalski and Stepp 1983; 
Murty and Jain 1995]. 

6.2 Object and Character Recognition 

6.2.1 Object Recognition. The use of 
clustering to group views of 3D objects 
for the purposes of object recognition in 
range data was described in Dorai and 
Jain [1995], The term view refers to a 
range image of an unoccluded object 
obtained from any arbitrary viewpoint. 
The system under consideration em- 
ployed a viewpoint dependent (or view- 
centered) approach to the object fecog- 
nition problem; each object to be 
recognized was represented in terms of 
a library of range images of that object. 

There are many possible views of a 3D 
object and one goal of that work was to 
avoid matching an unknown input view 
against each image of each object. A 
common theme in the object recognition 
literature is indexing, wherein the un- 
known view is used to select a subset of 
views of a subset of the objects in the 
database for further comparison, and 
rejects all other views of objects. One of 
the approaches to indexing employs the 
notion of view classes; a view class is the 
set of qualitatively similar views of an 
object. In that work, the view classes 
were identified by clustering; the rest of 
this subsection outlines the technique. 

Object views were grouped into 
classes based on the similarity of shape 
spectral features. Each input image of 
an object viewed in isolation yields a 
feature vector which characterizes that 
view. The feature vector contains the 
first ten central moments of a normal- 
ized shape spectral distribution, H(h), 
of an object view. The shape spectrum of 
an object view is obtained from its range 
data by constructing a histogram of 



shape index values (which are related to 
surface curvature values) and accumu- 
lating all the object pixels that fall into 
each bin. By normalizing the spectrum 
with respect to the total object area, the 
scale (size) differences that may exist 
between different objects are removed. 
The first moment mi is computed as the 
weighted mean of H(h): 

m 1 = ?,(h)H(h). (1) 

The other central moments, m p , 2 < p 
s 10 are defined as: 

m p = £(A - m^Hih). (2) 

Then, the feature vector is denoted as 
R = (m 1( m 2 , ■ ■ ■, m 10 ), with the 
range of each of these moments being 
[-1,1]. 

Let 0 = {O 1 , O 2 , • • •, O"} be a col- 
lection of n 3D objects whose views are 
present in the model database, M D . The 
ith view of the jth object, O) in the 
database is represented by (L), J?j), 
where L) is the object label and R) is the 
feature vector. Given a set of object 
representations 91 ; = {{L\, R\), • • •, 
(Z4, Ri,)} that describes m views of the 
ith object, the goal is to derive a par- 
tition of the views, 2P' = {C[, 
C\, ■ ■ ■, C\}. Each cluster in 9>' con- 
tains those views of the ith object that 
have been adjudged similar based on 
the dissimilarity between the corre- 
sponding moment features of the shape 
spectra of the views. The measure of 
dissimilarity, between R) and R l k , is de- 
fined as: 

i-i 

6.2.2 Clustering Views. A database 
containing 3,200 range images of 10 dif- 
ferent sculpted objects with 320 views 
per object is used [Dorai and Jain 1995]. 
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The range images from 320 possible 
viewpoints (determined by the tessella- 
tion of the view-sphere using the icosa- 
hedron) of the objects were synthesized. 
Figure 31 shows a subset of the collec- 
tion of views of Cobra used in the exper- 
iment. 

The shape spectrum of each view is 
computed and then its feature vector is 
determined. The views of each object 
are clustered, based on the dissimilarity 
measure 9) between their moment vec- 
tors using the complete-link hierarchi- 
cal clustering scheme [Jain and Dubes 
1988]. The hierarchical grouping ob- 
tained with 320 views of the Cobra ob- 



ject is shown in Figure 32. The view 
grouping hierarchies of the other nine 
objects are similar to the dendrogram in 
Figure 32. This dendrogram is cut at a 
dissimilarity level of 0.1 or less to ob- 
tain compact and well-separated clus- 
ters. The clusterings obtained in this 
manner demonstrate that the views of 
each object fall into several distinguish- 
able clusters. The centroid of each of 
these clusters was determined by com- 
puting the mean of the moment vectors 
of the views falling into the cluster. 

Dorai and Jain [1995] demonstrated 
that this clustering-based view grouping 
procedure facilitates object matching 
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0 views of a cobra sculpture 



in terms of classification'^ accuracy and 
the number of matches necessary for 
correct classification of test views. Ob- 
ject views are grouped into compact and 
homogeneous view clusters, thus dem- 
onstrating the power of the cluster- 
based scheme for view organization and 
efficient object matching. 

6.2.3 Character Recognition. Clus- 
tering was employed in Conrtell and 
Jain [1998] to identify lexemes in hand- 
written text for the purposes of writer- 
independent handwriting recognition. 
The success of a handwriting recogni- 
tion system is vitally dependent on its 
acceptance by potential users. Writer- 
dependent systems provide a higher 
level of recognition accuracy than writ- 
er-independent systems, but require, a 
large amount of training data. A writer- 



independent system, on the other hand, 
must be able to recognize a wide variety 
of writing styles in order to satisfy an 
individual user. As the variability of the 
writing styles that must be captured by 
a system increases, it becomes more and 
more difficult to discriminate between 
different classes due to the amount of 
overlap in the feature space. One solu- 
tion to this problem is to separate the 
data from these disparate writing styles 
for each class into different subclasses, 
known as lexemes. These lexemes repre- 
sent portions of the data which are more 
easily separated from the data of classes 
other than that to which the lexeme 
belongs. 

In this system, handwriting is cap- 
tured by digitizing the (x, y) position of 
the pen and the state of the pen point 
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(up or down) at a constant sampling 
rate. Following some resampling, nor- 
malization, and smoothing, each stroke 
of the pen is represented as a variable- 
length string of points. A metric based 
on elastic template matching and dy- 
namic programming is defined to allow 
the distance between two strokes to be 
calculated. 

Using the distances calculated in this 
manner, a proximity matrix is con- 
structed for each class of digits (i.e., 0 
through 9). Each matrix measures the 
intraclass distances for a particular 
digit class. Digits in a particular class 
are clustered in an attempt to find a 
small number of prototypes. Clustering 
is done using the CLUSTER program 
described above [Jain and Dubes 1988], 
in which the feature vector for a digit is 
its N proximities to the digits of the 
same class. CLUSTER attempts to pro- 
duce the best clustering for each value 
of K over some range, where K is the 
number of clusters into which the data 
is to be partitioned. As expected, the 
mean squared error (MSE) decreases 
monotonically as a function of K. The 
"optimal" value of K is chosen by identi- 
fying a "knee" in the plot of MSE vs. K. 

When representing a cluster of digits 
by a single prototype, the best on-line 
recognition results were obtained by us- 
ing the digit that is closest to that clus- 
ter's center. Using this scheme, a cor- 
rect recognition rate of 99.33% was 
obtained. 



6.3 Information Retrieval 

Information retrieval (IR) is concerned 
with automatic storage and retrieval of 
documents [Rasmussen 1992]. Many 
university libraries use IR systems to 
provide access to books, journals, and 
other documents. Libraries use the Li- 
brary of Congress Classification (LCC) 
scheme for efficient storage and re- 
trieval of books. The LCC scheme con- 
sists of classes labeled A to Z [LC Clas- 
sification Outline 1990] which are used 
to characterize books belonging to dif- 



ferent subjects. For example, label Q 
corresponds to books in the area of sci- 
ence, and the subclass QA is assigned to 
mathematics. Labels QA76 to QA76.8 
are used for classifying books related to 
computers and other areas of computer 

There are several problems associated 
with the classification of books using 
the LCC scheme. Some of these are 
listed below: 

(1) When a user is searching for books 
in a library which deal with a topic 
of interest to him, the LCC number 
alone may not be able to retrieve all 
the relevant books. This is because 
the classification number assigned 
to the books or the subject catego- 
ries that are typically entered in the 
database do not contain sufficient 
information regarding all the topics 
covered in a book. To illustrate this 
point, let us consider the book Algo- 
rithms for Clustering Data by Jain 
and Dubes [1988]. Its LCC number 
is 'QA 278. J35'. In this LCC num- 
ber, QA -278 corresponds to the topic 
'cluster analysis', J corresponds to 
the firs,t author's name and 35 is the 
serial number assigned by the Li- 
brary of Congress. The subject cate- 
gories for this book provided by the 
publisher (which are typically en- 
tered in a database to facilitate 
search) are cluster analysis, data 
processing and algorithms. There is 
a chapter in this book [Jain and 
Dubes 1988] that deals with com- 
puter vision, image processing, and 
image segmentation. So a user look- 
ing for literature on computer vision 
and, in particular, image segmenta- 
tion will not be able to access this 
book by searching the database with 
the help of either the LCC number 
or the subject categories provided in 
the database. The LCC number for 
computer vision books is TA 1632 
[LC Classification 1990] which is 
yery different from the number QA 
278.J35 assigned to this book. 
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(2) There is an inherent problem in as- 
signing LCC numbers to books in a 
rapidly developing area. For exam- 
ple, let us consider the area of neu- 
ral networks. Initially, category 'QP' 
in LCC scheme was used to label 
books and conference proceedings in 
this area. For example, Proceedings 
of the International Joint Conference 
on Neural Networks [IJCNN'91] was 
assigned the number 'QP 363.3'. But 
most of the recent books on neural 
networks are given a number using 
the category label 'QA'; Proceedings 
of the IJCNN'92 [IJCNN'92] is as- 
signed the number 'QA 76.87'. Mul- 
tiple labels for books dealing with 
the same topic will force them to be 
placed on different stacks in a li- 
brary. Hence, there is a need to up- 
date the classification labels from 
time to time in an emerging disci* 
pline. 

(3) Assigning a number to a new book is 
a difficult problem. A book may deal 
with topics corresponding to two or 
more LCC numbers, and therefore, 
assigning a unique number to such 
a book is difficult. 

Murty and Jain [1995] describe a 
knowledge-based clustering scheme to 
group representations of books, which 
are obtained using the ACM CR (Associ- 
ation for Computing Machinery Com- 
puting Reviews) classification tree 
[ACM CR Classifications 1994]. This 
tree is used by the authors contributing 
to various ACM publications to provide 
keywords in the form of ACM CR cate- 
gory labels. This tree consists of 11 
nodes at the first level. These nodes are 
labeled A to K. Each node in this tree 
has a label that is a string of one or 
more symbols. These symbols are alpha- 
numeric characters. For example, 1515 
is the label of a fourth-level node in the 
tree. 

6.3.1 Pattern Representation. Each 
book is represented as a generalized list 
[Sangal 1991] of these strings using the 
ACM CR classification tree. For the 



sake of brevity in representation, the 
fourth-level nodes in the ACM CR clas- 
sification tree are labeled using numer- 
als 1 to 9 and characters A to Z. For 
example, the children nodes of 1.5.1 
(models) are labeled 1.5.1.1 to 1.5.1.6. 
Here, 1.5.1.1 corresponds to the node 
labeled deterministic, and 1.5.1.6 stands 
for the node labeled structural. In a 
similar fashion, all the fourth-level 
nodes in the tree can be labeled as nec- 
essary. From now on, the dots in be- 
tween successive symbols will be omit- 
ted to simplify the representation. For 
example, 1.5.1.1 will be denoted as 1511. 

We illustrate this process of represen- 
tation with the help of the book by Jain 
and Dubes [1988]. There are five chap- 
ters in this book. For simplicity of pro- 
cessing, we consider only the informa- 
tion in the chapter contents. There is a 
single entry in the table of contents for 
chapter 1, 'Introduction,' and so we do 
not extract any keywords from this. 
Chapter 2, labeled 'Data Representa- 
tion,' has section titles that correspond 
to the labels of the nodes in the ACM 
CR classification tree [ACM CR Classifi- 
cations, 1994] which are given below: 
(la) 1522 (feature evaluation and selec- 
tion), 

(2b) 1532 (similarity measures), and 
(3c) 1515 (statistical). 
Based on the above analysis, Chapter 2 of 
Jain and Dubes [1988] can be character- 
ized by the weighted disjunction 
((1522 v 1532 v I515)(l,4)).. The weights 
(1,4) denote that it is one of the four chap- 
ters which plays a role in the representa- 
tion of the book. Based on the table of 
contents, we can use one or more of the 
strings 1522, 1532, and 1515 to represent 
Chapter 2. In a similar manner, we can 
represent other chapters in this book as 
weighted disjunctions based on the table of 
contents and the ACM CR classification 
tree. The representation of the entire book, 
the conjunction of all these chapter repre- 
sentations, is given by (((1522 v 1532 v 
I515)(l,4) a ((1515 v I531)(2,4)) a 
((1541 v 146 v I434)(l,4))). 
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Currently, these representations are 
generated manually by scanning the ta- 
ble of contents of books in computer 
science area as ACM CR classification 
tree provides knowledge of computer 
science books only. The details of the 
collection of books used in this study are 
available in Murty and Jain [1995]. 

6.3.2 Similarity Measure. The simi- 
larity between two books is based on the 
similarity between the corresponding 
strings. Two of the well-known distance 
functions between a pair of strings are 
[Baeza- Yates 1992] the Hamming dis- 
tance and the edit distance. Neither of 
these two distance functions can be 
meaningfully used in this application. 
The following example illustrates the 
point. Consider three strings 1242, 1233, 
and H242. These strings are labels 
(predicate logic for knowledge represen- 
tation, logic programming, and distrib- 
uted database systems) of three fourth- 
level nodes in the ACM CR 
classification tree. Nodes 1242 and 1233 
are the grandchildren of the node la- 
beled 12 (artificial intelligence) and 
H242 is a grandchild of the node labeled 
H2 (database management). So, the dis- 
tance between 1242 and 1233 should be 
smaller than that between 1242 and 
H242. However, Hamming distance and 
edit distance [Baeza- Yates 1992] both 
have a value 2 between 1242 and 1233 
and a value of 1 between 1242 and 
H242. This limitation motivates the def- 
inition of a new similarity measure that 
correctly captures the similarity be- 
tween the above strings. The similarity 
between two strings is defined as the 
ratio of the length of the largest com- 
mon prefix [Murty and Jain 1995] be- 
tween the two strings to the length of 
the first string. For example, the simi- 
larity between strings 1522 and 151 is 
0.5. The proposed similarity measure is 
not symmetric because the similarity 
between 151 and 1522 is 0.67. The mini- 
mum and maximum values of this simi- 
larity measure are 0.0 and 1.0, respec- 
tively. The knowledge of the 
relationship between nodes in the ACM 



CR classification tree is captured by the 
representation in the form of strings. 
For example, node labeled pattern rec- 
ognition is represented by the string 15, 
whereas the string. 153 corresponds to 
the node labeled clustering. The similar- 
ity between these two nodes (15 and 153) 
is 1.0. A symmetric measure of similar- 
ity [Murty and Jain 1995] is used to 
construct a similarity matrix of size 100 
x 100 corresponding to 100 books used 
in experiments. 

6.3.3 An Algorithm for Clustering 
Books. The clustering problem can be 
stated as follows. Given a collection 36 
of books, we need to obtain a set % of 
clusters. A proximity dendrogram [Jain 
and Dubes 1988], using the complete- 
link agglomerative clustering algorithm 
for the collection of 100 books is shown 
in Figure 33. Seven clusters are ob- 
tained by choosing a threshold (t) value 
of 0.12. It is well known that different 
values for t might give different cluster- 
ings. This threshold value is chosen be- 
cause the "gap" in the dendrogram be- 
tween the levels at which six and seven 
clusters are formed is the largest. An 
examination of the subject areas of the 
books [Murty and Jain 1995] in these 
clusters revealed that the clusters ob- 
tained are indeed meaningful. Each of 
these clusters are represented using a 
list of string s and frequency s' f pairs, 
where s r is the number of books in the 
cluster in which s is present. For exam- 
ple, cluster Ci contains 43 books belong- 
ing to pattern recognition, neural net- 
works, artificial intelligence, and 
computer vision; a part of its represen- 
tation 9t(C x ) is given below. 

gfc(d) = ((B718,l), (C12.1), (7)0,2), 

(2)311,1), (7)312,2), (7)321,1), 
(1)322,1), (D329,l), . . . (746,3), 
(7461,2), (7462,1), (7463, 3), 
. . . (J26.1), (J6,l), 
(.761,7), (J71.1)) 
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These clusters of books and the corre- 
sponding cluster descriptions can be 
. used as follows: If a user is searching 
for books, say, on image segmentation 
(146), then we select cluster C 1 because 
its representation alone contains the 
string 146. Books B 2 (Neurocomputing) 
and B 18 (Sensory Neural Networks: Lat- 
eral Inhibition) are both members of clus- 
ter Cj even though their LCC numbers 
are quite different (B 2 is QA76.5.H4442, 
S 18 is QP363.3.N33). 

Four additional books labeled B 101 , 
B102, Bio3. and Bi 0i have been used to 
study the problem of assigning classifi- 
cation numbers to new books. The LCC 
numbers of these books are: (B 101 ) 
Q335.T39, (B 102 ) QA76.73.P356C57, 
(B 10S ) QA76.5.B76C.2, and (B 104 ) 
QA76.9D5W44. These books are' as- 
signed to clusters based on nearest 
neighbor classification. The nearest 
neighbor of S 101) a book on artificial 
intelligence, is fl 2 3 and so -Bioi is as- 
signed to cluster C x . It is observed that 
the assignment of these four books to 
the respective clusters is meaningful, 
demonstrating that knowledge-based 
clustering is useful in solving problems 
associated with document retrieval. 

6.4 Data Mining 

In recent years we have seen ever in- 
creasing volumes of collected data of all 
sorts. With so much data available, it is 
necessary to develop algorithms which 
can extract meaningful information 
from the vast stores. Searching for use- 
ful nuggets of information among huge 
amounts of data has become known as 
the field of data mining. 

Data mining can be applied to rela- 
tional, transaction, and spatial data- 
bases, as well as large stores of unstruc- 
tured data such as the World Wide Web. 
There are many data mining systems in 
use today, and applications include the 
U.S. Treasury detecting money launder- 
ing, National Basketball Association 



coaches detecting trends and patterns of 
play for individual players and teams, 
and categorizing patterns of children in 
the foster care system [Hedberg 1996]. 
Several journals have had recent special 
issues on data mining [Cohen 1996, 
Cross 1996, Wah 1996]. 

6.4.1 Data Mining Approaches. 
Data mining, like clustering, is an ex- 
ploratory activity, so clustering methods 
are well suited for data mining. Cluster- 
ing is often an important initial step of 
several in the data mining process 
[Fayyad 1996]. Some of the data mining 
approaches which use clustering are da- 
tabase segmentation, predictive model- 
ing, and visualization of large data- 
bases. 

Segmentation. Clustering methods 
are used in data mining to segment 
databases into homogeneous groups. 
This can serve purposes of data com- 
pression (working with the clusters 
rather than individual items), or to 
identify characteristics of subpopula- 
tions which can be targeted for specific 
purposes (e.g., marketing aimed at se- 
nior citizens). 

A continuous k-means clustering algo- 
rithm [Faber 1994] has been used to 
cluster pixels in Landsat images [Faber 
et al. 1994]. Each pixel originally has 7 
values from different satellite bands, 
including infra-red. These 7 values are 
difficult for humans to assimilate and 
analyze without assistance. Pixels with 
the 7 feature values are clustered into 
256 groups, then each pixel is assigned 
the value of the cluster centroid. The 
image can then be displayed with the 
spatial information intact. Human view- 
ers can look at a single picture and 
identify a region of interest (e.g., high- 
way or forest) and label it as a concept. 
The system then identifies other pixels 
in the same cluster as an instance of 
that concept. 

Predictive Modeling. Statistical meth- 
ods of data analysis usually involve hy- 
pothesis testing of a model the analyst 
already has in mind. Data mining can 
aid the user in discovering potential 
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Figure 34. The seven smallest clusters found in the document set. These are stemmed words. 



tool in which derived clusters can be 
exported as new attributes which can 
then be characterized by the system. 
For example, breakfast cereals are clus- 
tered according to calories, protein, fat, 
sodium, fiber, carbohydrate, sugar, po- 
tassium, and vitamin content per serv- 
ing. Upon seeing the resulting clusters, 
the user can export the clusters to Win- 
Viz as attributes. The system shows 
that one of the clusters is characterized 
by high potassium content, and the hu- 
man analyst recognizes the individuals 
in the cluster as belonging to the "bran" 
cereal family, leading to a generaliza- 
tion that "bran cereals are high in po- 
tassium." 

6.4.2 Mining Large Unstructured Da- 
tabases. Data mining has often been 
performed on transaction and relational 
databases which have well-defined 
fields which can be used as features, but 
there has been recent research on large 
unstructured databases such as the 
World Wide Web [Etzioni 19961. 

Examples of recent attempts to clas- 
sify Web documents using words or 
functions of words as features include 
Maarek and Shaul [1996] and Chekuri 
et al. [1999]. However, relatively small 
sets of labeled training samples and 
very large dimensionality limit the ulti- 
mate success of automatic Web docu- 



ment categorization based on words as 
features. 

Rather than grouping documents in a 
word feature space, Wulfekuhler and 
Punch [1997] cluster the words from a 
small collection of World Wide Web doc- 
uments in the document space. The 
sample data set consisted of 85 docu- 
ments from the manufacturing domain 
in 4 different user-defined categories 
(labor, legal, government, and design). 
These 85 documents contained 5190 dis- 
tinct word stems after common words 
(the, and, of) were removed. Since the 
words are certainly not uncorrelated, 
they should fall into clusters where 
words used in a consistent way across 
the document set have similar values of 
frequency in each document. 

if-means clustering was used to group 
the 5190 words into 10 groups. One 
surprising result was that an average of 
92% of the words fell into a single clus- 
ter, which could then be discarded for 
data mining purposes. The smallest 
clusters contained terms which to a hu- 
man seem semantically related. The 7 
smallest clusters from a typical run are 
shown in Figure 34. 

Terms which are used in ordinary 
contexts, or unique terms which do not 
occur often across the training docu- 
ment set will tend to cluster into the 
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large 4000 member group. This takes 
care of spelling errors, proper names 
which are infrequent, and terms which 
are used in the same manner through- 
out the entire document set. Terms used 
in specific contexts (such as file in the 
context of filing a patent, rather than a 
computer file) will appear in the docu- 
ments consistently with other terms ap- 
propriate to that context (patent, invent) 
and thus will tend to cluster together. 
Among the groups of words, unique con- 
texts stand out from the crowd. 

After discarding the largest cluster, 
the smaller set of features can be used 
to construct queries for seeking out 
other relevant documents on the Web 
using standard Web searching tools 
(e.g., Lycos, Alta Vista, Open Text). 

Searching the Web with terms taken 
from the word clusters allows discovery 
of finer grained topics (e.g., family med- 
ical leave) within the broadly defined 
categories (e.g., labor). 

6.4.3 Data Mining in Geological Da- 
tabases. Database mining is a critical 
resource in oil exploration and produc- 
tion. It is common knowledge in the oil 
industry that the typical cost of drilling 
a new offshore well is in the 'range of 
$30-40 million, but the chance of that 
site being an economic success is 1 in 
10. More informed and systematic drill- 
ing decisions can significantly reduce 
overall production costs. 

Advances in drilling technology and 
data collection methods have led to oil 
companies and their ancillaries collect- 
ing large amounts of geophysical/geolog- 
ical data from production wells and ex- 
ploration sites, and then organizing 
them into large databases. Data mining 
techniques has recently been used to 
derive precise analytic relations be- 
tween observed phenomena and param- 
eters. These relations can then be used 
to quantify oil and gas reserves. 

In qualitative terms, good recoverable 
reserves have high hydrocarbon satura- 
tion that are trapped by highly porous 
sediments (reservoir porosity) and sur- 
rounded by hard bulk rocks that pre- 



vent the hydrocarbon from leaking 
away. A large volume of porous sedi- 
ments is crucial to finding good recover- 
able reserves, therefore developing reli- 
able and accurate methods for 
estimation of sediment porosities from 
the collected data is key to estimating 
hydrocarbon potential. 

The general rule of thumb experts use 
for porosity computation is that it is a 
quasiexponential function of depth: 

Porosity = K ■ e~** * Im, ^ tA . (4) 

A number of factors such as rock types, 
structure, and cementation as parame- 
ters of function F confound this rela- 
tionship. This necessitates the defini- 
tion of proper contexts, in which to 
attempt discovery of porosity formulas. 
Geological contexts are expressed in 
terms of geological phenomena, such as 
geometry, lithology, compaction, and 
subsidence, associated with a region. It 
is well knpwn that geological context 
changes from basin to basin (different 
geographical areas in the world) and 
also from region to region within a ba- 
sin [Allen and Allen 1990; Biswas 1995]. 
Furthermore, the underlying features of 
contexts may vary greatly. Simple 
model matching techniques, which work 
in engineering domains where behavior 
is constrained by man-made systems 
and well-established laws of physics, 
may not apply in the hydrocarbon explo- 
ration domain. To address this, data 
clustering was used to identify the rele- 
vant contexts, and then equation discov- 
ery was carried out within each context. 
The goal was to derive the subset x lt 
x 2 , x m from a larger set of geological 
features, and the functional relation- 
ship F that best defined the porosity 
function in a region. 

The overall methodology illustrated 
in Figure 35, consists of two primary 
steps: (i) Context definition using unsu- 
pervised clustering techniques, and (ii) 
Equation discovery by regression analy- 
sis [Li and Biswas 1995], Real explora- 
tion data collected from a region in the 
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(1) Context Definition 

1.1 discover primitive structures (gi,92,...,g m ) by clustering, 

1.2 define context in terms of the relevant sequences of primitive structures, i.e., C { = 
9n°9i2°,-,°9ik, 

1.3 group data according to the context definition to form homogeneous data groups, 

1.4 for each relevant data group, determine the set of relevant variables (xi,x 2 , ...,x k ) for 
porosity. 

(2) Equation Derivation 

2.1 select possible base models (equations) using domain theory, 

2.2 use the least squares method to generate coefficient values for each base model, 

2.3 use the component plus residual plot {cprp) heuristic to dynamically modify the equation 
model to better fit the data, 

Figure 35. Description of the knowledge-based scientific discovery process. 



Alaska basin was analyzed using the 
methodology developed. The data ob- 
jects (patterns) are described in terms of 
37 geological features, such as porosity, 
permeability, grain size, density, and 
sorting, amount of different mineral 
fragments (e.g., quartz, chert, feldspar) 
present, nature of the rock fragments, 
pore characteristics, and cementation. 
AH these feature values are numeric 
measurements made on samples ob- 
tained from well-logs during exploratory 
drilling processes. 

The k -means clustering algorithm 
was used to identify a set of homoge- 
neous primitive geological structures 
(Si, 82, g m )- These primitives were 
then mapped onto the unit code versus 
stratigraphic unit map. Figure 36 de- 
picts a partial mapping for a set of wells 
and four primitive structures. The next 
step in the discovery process identified 
sections of wells regions that were made 
up of the same sequence of geological 
primitives. Every sequence defined a 
context Cj. From the partial mapping of 
Figure 36,. the context = g 2 0 gi 0 
gi 0 gi was identified in two well re- 
gions (the 300 and 600 series). After the 
contexts were defined, data points be- 
longing to each context were grouped 
together for equation derivation. The 
derivation procedure employed multiple 
regression analysis [Sen and Srivastava 
1990]. 

This method was applied to a data set 
of about 2600 objects corresponding to 



sample measurements collected from 
wells is the Alaskan Basin. The 
k -means clustered this data set into 
seven groups. As an illustration, we se- 
lected a set of 138 objects representing a 
context for further analysis. The fea- 
tures that best defined this cluster were 
selected, and experts surmised that the 
context represented a low porosity re- 
gion, which was modeled using the re- 
gression procedure. 

7. SUMMARY 

There are several applications where 
decision making and exploratory pat- 
tern analysis have to be performed on 
large data sets. For example, in docu- 
ment retrieval, a set of relevant docu- 
ments has to be found among several 
millions of documents of dimensionality 
of more than 1000. It is possible to 
handle these problems if some useful 
abstraction of the data is obtained and 
is used in decision making, rather than 
directly using the entire data set. By 
data abstraction, we mean a simple and 
compact representation of the data. 
This simplicity helps the machine in 
efficient processing or a human in com- 
prehending the structure in data easily. 
Clustering algorithms are ideally suited 
for achieving data abstraction. 

In this paper, we have examined var- 
ious steps in clustering: (1) pattern rep- 
resentation, (2) similarity computation, 
(3) grouping process, and (4) cluster rep- 
resentation. Also, we have discussed 
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statistical, fuzzy, neural, evolutionary, 
and knowledge-based approaches to 
clustering. We have described four ap- 
plications of clustering: (1) image seg- 
mentation, (2) object recognition, (3) 
document retrieval, and (4) data min- 
ing- 
Clustering is a process of grouping 
data items based on a measure of simi- 
larity. Clustering is a subjective pro- 
cess; the same set of data items often 
needs to be partitioned differently for 
different applications. This subjectivity 
makes the process of clustering difficult. 
This is because a single algorithm or 
approach is not adequate to solve every 
clustering problem. A possible solution 
lies in reflecting this subjectivity in the 
form of knowledge. This knowledge is 
used either implicitly or explicitly m 
one or more phases of clustering. 
Knowledge-based clustering algorithms 
use domain knowledge explicitly. 

The most challenging step in cluster- 
ing is feature extraction or pattern rep- 
resentation. Pattern recognition re- 
searchers conveniently avoid this step 
by assuming that the pattern represen- 



tations are available as input to the 
clustering algorithm. In small size data 
sets, pattern representations can be ob- 
tained based on previous experience of 
the user with the problem. However, in 
the case of large data sets, it is difficult 
for the user to keep track of the impor- 
tance of each feature in clustering. A 
solution is to make as many measure- 
ments on the patterns as possible and 
use them in pattern representation. But 
it is not possible to use a large collection 
of measurements directly in clustering 
because of computational costs. So sev- 
eral feature extraction/selection ap- 
proaches have been designed to bbtain 
linear or nonlinear combinations of 
these measurements which can be used 
to represent patterns. Most of the 
schemes proposed for feature extrac- 
tion/selection are typically iterative in 
nature and cannot be used on large data 
sets due to prohibitive computational 
costs. . . 

The second step in clustering is simi- 
larity computation. A variety of 
schemes have been used to compute 
similarity between two patterns. They 
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use knowledge either implicitly or ex- 
plicitly. Most of the knowledge-based 
clustering algorithms use explicit 
knowledge in similarity computation. 
However, if patterns are not repre- 
sented using proper features, then it is 
not possible to get a meaningful parti- 
tion irrespective of the quality and 
quantity of knowledge used in similar- 
ity computation. There is no universally 
acceptable scheme for computing simi- 
larity between patterns represented us- 
ing a mixture of both qualitative and 
quantitative features. Dissimilarity be- 
tween a pair of patterns is represented 
using a distance measure that may or 
may not be a metric. 

The next step in clustering is the 
grouping step. There are broadly two 
grouping schemes: hierarchical and par- 
titional schemes. The hierarchical 
schemes are more versatile, and the 
partitional schemes are less expensive. 
The partitional algorithms aim at maxi- 
mizing the squared error criterion func- 
tion. Motivated by the failure of the 
squared error partitional clustering al- 
gorithms in finding the optimal solution 
to this problem, a large collection of 
approaches have been proposed and 
used to obtain the global optimal solu- 
tion to this problem. However, these 
schemes are computationally prohibi- 
tive on large data sets. ANN-based clus- 
tering schemes are neural implementa- 
tions of the clustering algorithms, and 
they share the undesired properties of 
these algorithms. However, ANNs have 
the capability to automatically normal- 
ize the data and extract features. An 
important observation is that even if a 
scheme can find the optimal solution to 
the squared error partitioning problem, 
it may still fall short of the require- 
ments because of the possible non-iso- 
tropic nature of the clusters. 

In some applications, for example in 
document retrieval, it may be useful to 
have a clustering that is not a partition. 
This means clusters are overlapping. 
Fuzzy clustering and functional cluster- 
ing are ideally suited for this purpose. 
Also, fuzzy clustering algorithms can 



handle mixed data types. However, a 
major problem with fuzzy clustering is 
that it is difficult to obtain the member- 
ship values. A general approach may 
not work because of the subjective na- 
ture of clustering. It is required to rep- 
resent clusters obtained in a suitable 
form to help the decision maker. Knowl- 
edge-based clustering schemes generate 
intuitively appealing descriptions of 
clusters. They can be used even when 
the patterns are represented using a 
combination of qualitative and quanti- 
tative features, provided that knowl- 
edge linking a concept and the mixed 
features are available. However, imple- 
mentations of the conceptual clustering 
schemes are computationally expensive 
and are not suitable for grouping large 
data sets. 

The k -means algorithm and its neural 
implementation, the Kohonen net, are 
most successfully used on large data 
sets. This is because k -means algorithm 
is simple to implement and computa- 
tionally attractive because of its linear 
time complexity. However, it is not fea- 
sible to use even this linear time algo- 
rithm on large data sets. Incremental 
algorithms like leader and its neural 
implementation, the ART network, can 
be used to cluster large data sets. But 
they tend to be order-dependent. Divide 
and conquer is a heuristic that has been 
rightly exploited by computer algorithm 
designers to reduce computational costs. 
However, it should be judiciously used 
in clustering to achieve meaningful re- 
sults. 

In summary, clustering is an interest- 
ing, useful, and challenging problem. It 
has great potential in applications like 
object recognition, image segmentation, 
and information filtering and retrieval. 
However, it is possible to exploit this 
potential only after making several de- 
sign choices carefully. 

ACKNOWLEDGMENTS 

The authors wish to acknowledge the 
generosity of several colleagues who 



ACM Computing Surveys, Vol. 31, No. 3, September 1999 



Data Clustering • 317 



read manuscript drafts, made sugges- 
tions, and provided summaries of 
emerging application areas which we 
have incorporated into this paper. Gau- 
tam Biswas and Cen Li of Vanderbilt 
University provided the material on 
ble knowledge discovery in geological data- 

wl- bases. Ana Fred of Institute Superior 

Texnico in Lisbon, Portugal provided 
material on cluster analysis in the syn- 
tactic domain. William Punch and Mari- 
lyn Wulfekuhler of Michigan State Uni- 
versity provided material on the 
application of cluster analysis to data 
mining problems. Scott Connell of Mich- 
igan State provided material describing 
his work on character recognition. Chi- 
tra Dorai of IBM T.J. Watson Research 
Center provided material on the use of 
clustering in 3D object recognition. 
Jianchang Mao of IBM Almaden Re- 
search Center, Peter Bajcsy of the Uni- 
versity of Illinois, and Zoran Obradovid 
of Washington State University also 
provided many helpful comments. Mario 
de Figueirido performed a meticulous 
reading of the manuscript and provided 
many helpful suggestions. 

This work was supported by the Na- 
tional Science Foundation under grant 
INT-9321584. 

REFERENCES 

Aakts, E. and K0RST, J. 1989. Simulated An- 
nealing and Boltzmann Machines: A Stochas- 
tic Approach to Combinatorial Optimization 
and Neural Computing. Wiley-Interscience 
series in discrete mathematics and optimiza- 
tion. John Wiley and Sons, Inc., New York, 
NY, 

ACM, 1994. ACM CR Classifications. ACAf 

Computing Surveys 35, 6-16. 
Al-Sultan, K. S. 1995. A tabu search approach 
to clustering problems. Pattern Recogn. 28, 
1443-1451. 

Al-Sultan, K. S. and Khan, M. M. 1996. 
Computational experience on four algorithms 
for the hard clustering problem. Pattern 
Recogn. Lett. 17, 3, 295-308. 
Allen, P. A. and Allen, J. R. 1990. Basin 
Analysis: Principles and Applica- 
tions. Blackwell Scientific Publications, Inc., 
Cambridge, MA. 
Alta Vista, 1999. http://altavista.digital.com. 
Amadasun, M. and King, R. A. 1988. Low-level 
segmentation of multispectral images via ag- 



glomerative clustering of uniform neigh- 
bourhoods. Pattern Recogn. 21, 3 (1988), 



Anderberg, M. R. 1973. Cluster Analysis for 
Applications. Academic Press, Inc., New 
York, NY. 

AUGUSTSON, J. G. AND Minker, J. 1970. An 
analysis of some graph theoretical clustering 
techniques. J. ACM 17, 4 (Oct. 1970), 571- 
688. 

Babu, G. P. and Murty, M. N. 1993. A near- 
optimal initial seed value selection in 
If -means algorithm using a genetic algorithm. 
Pattern Recogn. Lett. 14, 10 (Oct. 1993), 763- 
769. 

Babu, G. P. and Murty, M. N. 1994. Clustering 
with evolution strategies. Pattern Recogn. 
27, 321-329. 

Babu, G. P., Murty, M. N., and Keerthi, S. 
S. 2000. Stochastic connectionist approach 
for pattern clustering (To appear). IEEE 
Trans. Syst. Man Cybern.. 

Backer, F. B. and Hubert, L. J. 1976. A graph- 
theoretic approach to goodness-of-fit in com- 
plete-link hierarchical clustering. J. Am. 
Stat. Assoc. 71, 870-878. 

BACKER, E. 1995. Computer-Assisted Reasoning 
in Cluster Analysis. Prentice Hall Interna- 
tional (UK) Ltd., Hertfordshire, UK. 

Baeza-Yates, R. A. 1992. Introduction to data 
structures and algorithms related to informa- 
tion retrieval. In Information Retrieval: 
Data Structures and Algorithms, W. B. 
Frakes and R. Baeza-Yates, Eds. Prentice- 
Hall, Inc., Upper Saddle River, NJ, 13-27. 

Bajcsy, P. 1997. Hierarchical segmentation 
and clustering using similarity analy- 
sis. Ph.D. Dissertation. Department of 
Computer Science, University of Illinois at 
Urbana-Champaign, Urbana, IL. 

Ball, G. H. and Hall, D. J. 1965. ISODATA, a 
novel method of data analysis and classifica- 
tion. Tech. Rep.. Stanford University, 
Stanford, CA. 

Bentley, J. L. and Friedman, J. H. 1978. Fast 
algorithms for constructing minimal spanning 
trees in coordinate spaces. IEEE Trans. 
Comput. C-27, 6 (June), 97-105. 

Bezdek, J. C. 1981. .Pattern Recognition With 
Fuzzy Objective Function Algorithms. Ple- 
num Press, New York, NY. 

Bhuyan, J. N., Raghavan, V. V., and Venkatesh, 
K. E. 1991. Genetic algorithm for cluster- 
• ing with an ordered representation. In Pro- 
ceedings of the Fourth International Confer- 
ence on Genetic Algorithms, 408-415. 

Biswas, G., Weinberg, J., and Li, C. 1995. A 
Conceptual Clustering Method for Knowledge 
Discovery in Databases. Editions Technip. 

Brailovsky, V. L. 1991. A probabilistic ap- 
proach to clustering. Pattern Recogn. Lett. 
12, 4 (Apr. 1991), 193-198. 



ACM Computing Surveys, Vol. 31, No. 3, September 1999 



Hi I 



318 • A. Jain et al. 



Brodatz, P. 1966. Textures: A Photographic Al- 
bum for Artists and Designers. Dover Publi- 
cations, Inc., Mineola, NY. 
Can, F. 1993. Incremental clustering for dy- 
namic information processing. ACM Trans. 
Inf. Syst. 11,2 (Apr. 1993), 143-164. 
Carpenter, G. and Grossberg, S. 1990. ART3: 
Hierarchical search using chemical transmit- 
ters in self-organizing pattern recognition ar- 
chitectures. Neural Networks 3, 129-152. 
Chekuri, C, Goldwasser, M. H., Ragiiavan, P., 
AND Upfal, E. 1997. Web search using au- 
tomatic classification. In Proceedings of the 
Sixth International Conference on the World 
Wide Web (Santa Clara, CA, Apr.), http:// 
theory.stanford.edu/people/wass/publications/ 
Web Search/Web Searckhtml. 
Cheng, C. H. 1995. A branch-and-bound clus- 
tering algorithm. IEEE Trans. Syst. Man 
Cybern. 25, 895-898. 
Cheng, Y. and Fu, K. S. 1986. Conceptual clus- 
tering in knowledge organization. IEEE 
Trans. Pattern Anal. Mach. Intell. 7, 592-598. 
CHENG, Y. 1995. Mean shift, mode seeking, and 
clustering. IEEE Trans. Pattern Anal. Mach. 
Intell. 17, 7 (July), 790-799. 
Chkn, Y. T. 1978. Interactive Pattern Recogni- 
tion. Marcel Dekker, Inc., New York, NY. 
Choudhury, S. and Murty, M. N. 1990. A divi- 
sive scheme for constructing minimal span- 
ning trees in coordinate space. Pattern 
Recogn. Lett. 11, 6 (Jun. 1990), 385-389. 
1996. Special issue on data mining. Commun. 

ACM 39, 11. 
Coleman, G. B. and Andrews, H. 
C. 1979. Image segmentation by cluster- 
ing. Proc. IEEE 67, 5, 773-785. 
Connell, S. AND Jain, A. K. 1998. Learning 
prototypes for on-line handwritten digits. In 
Proceedings of the 14th International Confer- 
ence on Pattern Recognition (Brisbane, Aus- 
tralia, Aug.), 182-184. 
Cross, S. E., Ed. 1996. Special issue on data 

mining. IEEE Expert 11, 6 (Oct.). 
Dale, M. B. 1986. On the comparison of con- 
ceptual clustering and numerical taxono- 
my. IEEE Trans. Pattern Anal. Mach. Intell. 
7, 241-244. : 
DAVE, R. N. 1992. Generalized fuzzy C-shells 
clustering and detection of circular and ellip- 
tic boundaries. Pattern Recogn. 25, 713-722. 
Davis, T., Ed. 1991. The Handbook of Genetic 
Algorithms. Van Nostrand Reinhold Co., 
New York, NY. 
Day, W. H. E. 1992. Complexity theory: An in- 
troduction for practitioners of classifica- 
tion. In Clustering and Classification, P. 
Arabie and L. Hubert, Eds. World Scientific 
Publishing Co., Inc., River Edge, NJ. 
Dempster, A. P., Laird, N. M., and Rubin, D. 
B. 1977. Maximum likelihood from incom- 
plete data via the EM algorithm. J. Royal 
Stat. Soc. B. 39, 1, 1-38. 



Diday, E. 1973. The dynamic cluster method in 

non-hierarchical clustering. J. Comput. Inf. 

Sci. 2, 61-88. 
Diday, E. and Simon, J. C. 1976. Clustering 

analysis. In Digital Pattern Recognition, K. 

S. Fu, Ed. Springer-Verlag, Secaucus, NJ, 

47-94. 

Diday, E. 1988. The symbolic approach in clus- 
tering. In Classification and Related Meth- 
ods, H. H. Bock, Ed. North-Holland Pub- 
lishing Co., Amsterdam, The Netherlands. 

Dorai, C. and JAm, A. K. 1995. Shape spectra 
based view grouping for free-form objects. In 
Proceedings of the International Conference on 
Image Processing (ICIP-95), 240-243. 

Dubes, R. C. and Jain, A. K. 1976. Clustering 
techniques: The user's dilemma. Pattern 
Recogn. 8, 247-260. 

Dubes, R. C. and JAIN, A. K. 1980. Clustering 
methodology in exploratory data analysis. In 
Advances in Computers, M. C. Yovits,, Ed. 
Academic Press, Inc., New York, NY, 113- 
125. 

Dubes, R. C. 1987. How many clusters are 
best?— an experiment. Pattern Recogn. 20, 6 
(Nov. 1, 1987), 645-663. 

Dubes, R. C. 1993. Cluster analysis and related 
issues. In Handbook of Pattern Recognition 
& Computer Vision, C. H. Chen, L. F. Pau, 
and P. S. P. Wang, Eds. World Scientific 
Publishing Co., Inc., River Edge, NJ, 3-32. 

Dubuisson, M. P. and Jain, A. K. 1994. A mod- 
ified Hausdorff distance for object matchin- 
• g. In Proceedings of the International Con- 
ference on Pattern Recognition (ICPR 
'94), 566-568. 

Duda, R. O. and Hart, P. E. 1973. Pattern 
Classification and Scene Analysis. John 
Wiley and Sons, Inc., New York, NY. 

Dunn, S., Janos, L., and Rosenfeld, A. 1983. 
Bimean clustering. Pattern Recogn. Lett. 1, 
169-173. 

Duran, B. S. and Odell, P. L. 19.74. Cluster 
Analysis: A Survey. Springer-Verlag, New 
York, NY. 

Eddy, W. F., Mockus, A., and Oue, S. 1996. 
Approximate single linkage cluster analysis of 
large data sets in high-dimensional spaces. 
Comput. Stat. Data Anal. 23, 1, 29-43. 

Etzioni, 0. 1996. The World-Wide Web: quag- 
mire or gold mine? Commun. ACM 39, 11, 
65-68. 

Everitt, B. S. 1993. Cluster Analysis. Edward 

Arnold, Ltd., London, UK. 
Fabbr, V. 1994. Clustering and the continuous 

k-means algorithm. Los Alamos Science 22, 

138-144. 

Faber, V., Hochberg, J. C, Kelly, P. M., Thomas, 
T. R., AND White, J. M. 1994. Concept ex- 
traction: A data-mining technique. Los 
Alamos Science 22, 122-149. 
. Fayyad, U. M. 1996. Data mining and knowl- 
edge discovery: Making sense out of data. 
IEEE Expert 11, 5 (Oct.), 20-25. 



ACM Computing Surveys, Vol. 31, No. 3, September 1999' 



Data Clustering • 319 



Fisher, D. and Langley, P. 1986. Conceptual 
clustering and its relation to numerical tax- 
onomy. In Artificial Intelligence and Statis- 
tics, A W. Gale, Ed. Addison-Wesley Long- 
man Publ. Co., Inc., Reading, MA, 77-116. 

FlSHER, D. 1987. Knowledge acquisition via in- 
cremental conceptual clustering. Mach. 
Learn. 2, 139-172. 

Fisher, D., Xu, L., Carnes, R., Rich, Y., Fenves, S. 
J., Chen, J., Shiavi, R., Biswas, G., and Wein- 
berg, J. 1993. Applying AI clustering to 
engineering tasks. IEEE Expert 8, 51-60- 

Fisher, L. and Van Ness, J. W. 1971. 
Admissible clustering procedures. Biometrika 
58, 91-104. 

Flynn, P. J. and Jain, A. K. 1991. BONSAI: 3D 
object recognition using constrained search. 
IEEE Trans. Pattern Anal. Mach. Intell. 13, 
10 (Oct. 1991), 1066-1075. 

Fogel, D. B. and Simpson, P. K. 1993. Evolving 
fuzzy clusters. In Proceedings of the Interna- 
tional Conference on Neural Networks (San 
Francisco, CA), 1829-1834. 

Fogel, D. B. and Fogel, L. J., Eds. 1994. Spe- 
cial issue on evolutionary computation. 
IEEE Trans. Neural Netw. (Jan.). 

Fogel, L. J., Owens, A. J., and Walsh, M. 
J. 1965. Artificial Intelligence Through 
Simulated Evolution. John Wiley and Sons, 
Inc., New York, NY. 

Frakes, W. B. and Baeza-Yates, R., Eds. 
1992. Information Retrieval: Data Struc- 
tures and Algorithms. Prentice-Hall, Inc., 
Upper Saddle River, NJ. 

Fred, A. L. N. and Leitao, J. M. N. 1996. A 
minimum code length technique for clustering 
of syntactic patterns. In Proceedings of the 
International Conference on Pattern Recogni- 
tion (Vienna, Austria), 680-684. 

Fred, A. L. N. 1996. Clustering of sequences 
using a minimum grammar complexity crite- 
rion. In Grammatical Inference: Learning 
Syntax from Sentences, L. Miclet and C. 
Higuera, Eds. Springer-Verlag, Secaucus, 
NJ, 107-116. 

Fu, K. S. and Lu, S. Y. 1977. A clustering pro- 
cedure for syntactic patterns. IEEE Trans. 
Syst. Man Cybern. 7, 734-742. 

Fu, K. S. and Mui, J. K. 1981. A survey on 
image segmentation. Pattern Recogn. 13, 
3-16. 

Fukunaga, K. 1990. Introduction to Statistical 
Pattern Recognition. 2nd ed. Academic 
Press Prof., Inc., San Diego, CA. 

Glover, P. 1986. Future paths for integer pro- 
gramming and links to artificial intelligence. 
Comput. Oper. Res. 13, 5 (May 1986), 533- 
549. 

Goldberg, D. E. 1989. Genetic Algorithms in 
Search, Optimization and Machine Learning. 
Addison-Wesley Publishing Co., Inc., Red- 
wood City, CA. 



Gordon, A. D. and Henderson, J. T. 1977. 

Algorithm for Euclidean sum of squares. 

Biometrics 33, 355-362. 
Gotlieb, G. C. and Kumar, S. 1968. Semantic 

clustering of index terms. J. ACM 15, 493- 

513. 

Gowda, K. C. 1984. A feature reduction and 
unsupervised classification algorithm for mul- 
tispectral data. Pattern Recogn. 17, 6, 667- 



Gowda, K. C. and Krishna, G. 1977. 
Agglomerative clustering using the concept of 
mutual nearest neighborhood. Pattern 
Recogn. 10, 105-112. 

Gowda, K. C. and Diday, E. 1992. Symbolic 
clustering using a new dissimilarity mea- 
sure. IEEE Trans. Syst. Man Cybern. 22, 
368-378. 

Gower, J. C. and Ross, G. J. S. 1969. Minimum 
spanning reeB and single-linkage cluster 
analysis. Appl. Stat. 18, 54-64. 

GREFENSTETTE, J 1986. Optimization of control 
parameters for genetic algorithms. IEEE 
Trans. Syst. Man Cybern. SMC-16, 1 (Jan./ 
Feb. 1986), 122-128. 

Haralick, R. M. and Kelly, G. L. 1969. 
Pattern recognition with measurement space 
and spatial clustering for multiple images. 
Proc. IEEE 57, 4, 654-665. 

Hartiqan, J. A. 1975. Clustering Algorithms. 
John Wiley and Sons, Inc., New York, NY. 

Hedberg, S. 1996. Searching for the mother 
lode: Tales of the first data miners. IEEE 
Expert 11, 5 (Oct.), 4-7. 

Hertz, J., Krogh, A., and Palmer, R. G. 1991. 
Introduction to the Theory of Neural Compu- 
tation. Santa Fe Institute Studies in the Sci- 
.. ences of Complexity lecture notes. Addison- 
Wesley Longman Publ. Co., Inc., Reading, 
MA. 

Hoffman, R. and Jain, A. K. 1987. 
Segmentation and classification of range im- 
ages. IEEE Trans. Pattern Anal. Mach. In- 
tell. PAMI-9, 5 (Sept. 1987), 608-620. 

Hofmann, T. and Buhmann, J. 1997. Pairwise 
data clustering by deterministic annealing. 
IEEE Trans. Pattern Anal. Mach. Intell. 19, 1 
(Jan.), 1-14. 

Hofmann, T., Puzicha, J., and Buchmann, J. 
M. 1998. Unsupervised texture segmenta- 
tion in a deterministic annealing framework. 
IEEE Trans. Pattern Anal. Mach. Intell. 20, 8, 
803-818. 

Holland, J. H. 1975. Adaption in Natural and 
Artificial Systems. University of Michigan 
Press, Ann Arbor, MI. 

Hoover, A., Jean-Baptiste, G., Jiang, X., Flynn, 
P. J., Bunke, H., Goldgof, D. B., Bowyer, K., 
Eggert, D. W., Fitzgibbon, A., and Fisher, R. 



n algor 



ACM Computing Surveys, Vol. 31, No. 3, September 1999 



320 • A. Jain et al. 



HUITENLOCHER, D. P., KLANDERMAN, G. A., AND 

BUCKUDGE, W. J. 1993. Comparing images 
using the HausdoriT distance. IEEE Trans. 
Pattern Anal. Mack. Intell. 15, 9, 850-863. 
Ichino, M. and Yaguchi, H. 1994. Generalized 
Minkowski metrics for mixed feature-type 
data analysis. IEEE Trans. Syst. Man Cy- 
bern. 24, 698-708. 

1991. Proceedings of the International Joint Con- 
ference on Neural Networks. (IJCNN'91). 

1992. Proceedings of the International Joint Con- 
ference on Neural Networks. 

Ismail, M. A. and Kamel, M. S. 1989. 
Multidimensional data clustering utilizing 
hybrid search strategies. Pattern Recogn. 22, 
1 (Jan. 1989), 75-89. 

Jain, A. K. and Dubes, R. C. 1988. Algorithms 
for Clustering Data. Prentice-Hall advanced 
reference series. Prentice-Hall, Inc., Upper 
Saddle River, NJ. 

Jain, A. K. and Farrokhnia, F. 1991. 
Unsupervised texture segmentation using Ga- 
bor filters. Pattern Recogn. 24, 12 (Dec. 
1991), 1167-1186. 

Jain, A. K. and Bhattacharjee, S. 1992. Text 
segmentation using Gabor filters for auto- 
matic document processing. Mach. Vision 
Appl. 5, 3 (Summer 1992), 169-184. ' 

Jain, A. J. and Flynn, P. J., Eds. 1993. Three 
Dimensional Object Recognition Systems. 
Elsevier Science Inc., New York, NY. 

Jain, A. K. and Mao, J. 1994. Neural networks 
and pattern recognition. In Computational 
Intelligence: Imitating Life, J. M. Zurada, R. 
J. Marks, and C. J. Robinson, Eds. 194- 
212. 

Jain, A. K. and Flynn, P. J. 1996. Image seg- 
mentation using clustering. In Advances in 
Image Understanding: A Festschrift for Azriel 
Rosenfeld, N. Ahuja and K. Bowyer, Eds, 
IEEE Press, Piscataway, NJ, 65-83. 

JAIN, A. K. AND Mao, J. 1996. Artificial neural 
networks: A tutorial. IEEE Computer 29 
(Mar.), 31-44. 

Jain, A. K., Ratha, N. K., and Lakshmanan, S. 
1997. Object detection using Gabor filters. 
Pattern Recogn. 30, 2, 295-309. 

Jain, N. C., Indrayan, A., and Goel, L. R. 
1986. Monte Carlo comparison of six hierar- 
chical clustering methods on random data. 
Pattern Recogn. 19, 1 (Jan./Feb. 1986), 95-99. 

Jain, R., Kasturi, R., and Schunck, B. G. 
1995. Machine Vision. McGraw-Hill series 
in computer science. McGraw-Hill, Inc., New 
York, NY. 

Jarvis, R. A. and Patrick, E. A. 1973. 
Clustering using a similarity method based on 
shared near neighbors. IEEE Trans. Corn- 
put. C-22, 8 (Aug.), 1025-1034. 

JOLION, J.-M., MEER, P., AND BATAOUCHE, S. 
1991. Robust clustering with applications in 
computer vision. IEEE Trans. Pattern Anal. 
Mach. Intell. 13, 8 (Aug. 1991), 791-802. 



Jones, D. and Beltramo, M. A. 1991. Solving 
partitioning problems with genetic algorithms. 
In Proceedings of the Fourth International 
Conference on Genetic Algorithms, 442-449. 

JUDD, D., MCKINLEY, P., AND JAIN, A. K. 
1996. Large-scale parallel data clustering. 
In Proceedings of the International Conference 
on Pattern Recognition (Vienna, Aus- 
tria), 488-493. 

King, B. 1967. Step-wise clustering proce- 
dures. J. Am. Stat. Assoc. 69, 86-101. 

KlRKPATRICK, S., GELATT, C. D., Jr., AND VECCHI 

M. P. 1983. Optimization by simulated an- 
nealing. Science 220, 4598 (May), 671-680. 
Klein, R. W. and Dubes, R. C. 1989. 
Experiments in projection and. clustering by 
simulated annealing. Pattern Recogn. 22 
213-220. 

Knuth.D. 1973. The Art of Computer Program- 
ming. Addison- Wesley, Reading, MA. 

KOONTZ, W. L. G., FUKONAGA, K., AND NARENDRA, 
P. M. 1975. A branch and bound clustering 
algorithm. IEEE Trans. Comput. 23, 908- 
914. 

Kohonen, T. 1989. Self-Organization and Asso- 
ciative Memory. 3rd ed. Springer informa- 
tion sciences series. Springer-Verlag, New 
York, NY. 

Kraaijveld, M., Mao, J., and Jain, A. K. 1995. 
A non-linear projection method based on Ko- 
honen's topology preserving maps. IEEE 
Trans. Neural Netw. 6, 548-559. 

Krishnapuram, R., Frigoi, H., and Nasraoui, O. 
1995. Fuzzy and probabilistic shell cluster- 
ing algorithms and their application to bound- 
ary detection and surface approximation. 
IEEE Trans. Fuzzy Systems 3, 29-60. 

Kurita, T. 1991. An efficient agglomerative 
clustering algorithm using a heap. Pattern 
Recogn. 24, 3 (1991), 205-209. 

Library of Congress, 1990. LC classification 
outline. Library of Congress, Washington, 
DC. 

Lebowitz, M. 1987. Experiments with incre- 
mental concept formation. Mach. Learn. 2, 
103-138. 

Lee, H.-Y. and Ong, H.-L. 1996. Visualization 
support for data mining. IEEE Expert 11, 5 
(Oct.), 69-75. 

Lee, R. C. T., Slaglb, J. R., and Mono, C. T. 
1978. Towards automatic auditing of 
records. IEEE Trans. Softw. Eng. 4, 441- 
448. 

LEE, R. C. T. 1981. Cluster analysis and its 
applications. In Advances in Information 
Systems Science, J. T. Tou, Ed. Plenum 
Press, New York, NY. 

Li, C. and Biswas, G. 1995. Knowledge-based 
scientific discovery in geological databases. 
In Proceedings of the First International Con- 
ference on Knowledge Discovery and Data 
Mining (Montreal, Canada, Aug. 20-21), 
204-209. 



ACM Computing Surveys, Vol. 31, No. 3, September 1999 



Data Clustering • 321 



living 



sring. 
Aus- 



Lu, S. Y. and Fu, K. S. 1978. A sentence-to- 
sentence clustering procedure for pattern 
analysis. IEEE Trans. Syst. Man Cybern. 8, 
381-389. 

Lundervold, A., Fenstad, A. M., Ersland, L., and 
Taxt, T. 1996. Brain tissue volumes from 
multispectral 3D MRI: A comparative study of 
four classifiers. In Proceedings of the Confer- 
ence of the Society on Magnetic Resonance, 

Maarek, Y. S. and Ben Shaul, I. Z. 1996. 
Automatically organizing bookmarks per con- 
tents. In Proceedings of the Fifth Interna- 
■ tional Conference on the World Wide Web 
(Paris, May), http://www5conf.inria.fr/fich- 
html/paper-sessions.html. 

McQueen, J. 1967. Some methods for classifi- 
cation and analysis of multivariate observa- 
tions. In Proceedings of the Fifth Berkeley 
Symposium on Mathematical Statistics and 
Probability, 281-297. 

Mao, J. AND JAIN, A. K. 1992. Texture classifi- 
cation and segmentation using multiresolu- 
tion simultaneous autoregressive models. 
Pattern Recogn. 25, 2 (Feb. 1992), 173-188. 

Mao, J. and Jain, A. K. 1996. Artificial neural 
networks for feature extraction and multivari- 
ate data projection. IEEE Trans. Neural 
Netw. 6, 296-317. 

Mao, J. and Jain, A. K. 1996. A self-organizing 
network for hyperellipsoidal clustering (HEC). 

■ IEEE Trans. Neural Netw. 7, 16-29. 

Mevins, A. J. 1995. A branch and bound incre- 
mental conceptual clusterer. Mach. Learn. 
18, 5-22. 

Michalski, R., Stepp, R. E., and Dh>ay, E. 
1981. A recent advance in data analysis: 
Clustering objects into classes characterized 
by conjunctive concepts. In Progress in Pat- 
tern Recognition, Vol. 1, L. Kanal and A. 
Rosenfeld, Eds. North-Holland Publishing 
Co., Amsterdam, The Netherlands. 

Michalski, R., Stepp, R. E., and Diday, 
E. 1983. Automated construction of classi- 
fications: conceptual clustering versus numer- 
ical taxonomy. IEEE Trans. Pattern Anal. 
Mach. Intell. PAMI-5, 5 (Sept.), 396-409. 

Mishra, S. K. and Raghavan, V. V. 1994. An 
empirical study of the performance of heuris- 
tic methods for clustering. In Pattern Recog- 
nition in Practice, E. S. Gelsema and L. N. 
Kanal, Eds. 425-436. 

Mitchell, T. 1997. Machine Learning. McGraw- 
Hill, Inc., New York, NY. 

Mohiuddin, K. M. and Mao, J. 1994. A compar- 
ative study of different classifiers for hand- 
printed character recognition. In Pattern 
Recognition in Practice, E. S. Gelsema and L. 
N. Kanal, Eds. 437-448. 

Moor, B. K. 1988. ART 1 and Pattern Cluster- 
ing. In 1988 Connectionist Summer School, 
Morgan Kaufmann, San Mateo, CA, 174-185. 

Murtagh, F. 1984. A survey of recent advances 
in hierarchical clustering algorithms which 
use cluster centers. Comput. J. 26, 354-359. 



Murty, M. N. and Krishna, G. 1980. A compu- 
tationally efficient technique for data cluster- 
ing. Pattern Recogn. 12, 153-158. 

Murty, M. N. and Jain, A. K. 1995. Knowledge- 
based clustering scheme for collection man- 
agement and retrieval of library books. Pat- 
tern Recogn. 28, 949-984. 

Nagy, G. 1968. Statu of the art in pattern rec- 
ognition. Proc. IEEE 56, 836-862. 

Ng, R. AND Han, J. 1994. Very large data bases. 
In Proceedings of the 20th International Con- 
ference on Very Large Data Bases (VLDB'94, 
Santiago, Chile, Sept.), VLDB Endowment, 
Berkeley, CA, 144-155. 

Nguyen, H. H. and Cohen, P. 1993. Gibbs ran- 
dom fields, fuzzy clustering, and the unsuper- 
vised segmentation of textured images. CV- 
GIP: Graph. Models Image Process. 56,1 (Jan. 
1993), 1-19. 

Oehler, K. L. and Gray, R. M. 1995. 
Combining image compression and classifica- 
tion using vector quantization. IEEE Trans. 
Pattern Anal. Mach. Intell. 17, 461-473. 

Oja, E. 1982. A simplified neuron model as a 
principal component analyzer. Bull. Math. 
Bio. 15, 267-273. 

Ozawa, K. 1985. A stratificational overlapping 
cluster scheme. Pattern Recogn. 18, 279-286. 

Open Text, 1999. http://index.opentext.net. 

Kamgar-Parsi, B., Guai.tieri, J. A., Devaney, J. A., 
and Kamgar-Parsi, K 1990. Clustering with 
neural networks. Biol. Cybern. 63, 201-208. 

LYCOS, 1999. http://www.lycos.com. 

Pal, N. R., Bezdek, J. C., and Tsao, E. C.-K. 
1993. Generalized clustering networks and 
Kohonen's self-organizing scheme. IEEE 
Trans. Neural Netw. 4, 549-657. 

Quinlan, J. R. 1990. Decision trees and deci- 
sion making. IEEE Trans. Syst. Man Cy- 
.bem. 20, 339-346. 

Raghavan, V. V. and Birchand, K. 1979. A 
clustering strategy based on a formalism of 
the reproductive process in a natural system. 
In Proceedings of the Second International 
Conference on Information Storage and Re- 
trieval, 10-22. 

Raghavan, V. V. and Yu, C. T. 1981. A compar- 
ison of the stability characteristics of. some 
graph theoretic clustering methods. IEEE 
Trans. Pattern Anal. Mach. Intell. 3, 393-402. 

Rasmossen, E. 1992. Clustering algorithms. 
In Information Retrieval: Data Structures and 
Algorithms, W. B. Frakes and R. Baeza-Yates, 
Eds. Prentice-Hall, Inc., Upper Saddle 
River, NJ, 419-442. 

RlCH.E. 1983. Artificial Intelligence. McGraw- 
Hill, Inc., New York, NY. 

Ripley, B. D., Ed. 1989. Statistical Inference 
for Spatial Processes. Cambridge University 
Press, New York, NY. 

Rose, K., Gurewitz, E., and Fox, G. C, 1993. 
Deterministic annealing approach to con- 
strained clustering. IEEE Trans. Pattern 
Anal. Mach. Intell. IB, 785-794. 



ACM Computing Surveys, Vol. 31, No. 3, 



322 • A. Jain et al. 

ROSENFELD, A. AND Kak, A. C. 1982. Digital Pic- 
ture Processing. 2nd cd. Academic Press, 
Inc., New York, NY. 

ROSENFELD, A., SCHNEIDER, V. B„ AND HUANG, M. 

K. 1969. An application of cluster detection 
to text and picture processing. IEEE Trans. 
Inf. Theor. 15, 6, 672-681. 

ROSS, G. J. S. 1968. Classification techniques 
for large sets of data. In Numerical Taxon- 
omy, A. J. Cole, Ed. Academic Press, Inc., 
New York, NY. 

Ruspini, E. H. 1969. A new approach to cluster- 
ing. Inf. Control 15, 22-32. 

Salton, G. 1991. Developments in automatic 
text retrieval. Science 253, 974-980. 

Samal, A. and Iyengar, P. A. 1992. Automatic 
recognition and analysis of human faces and 
facial expressions: A survey. Pattern Recogn. 
25, 1 (Jan. 1992), 65-77. 

Sammon, J. W. Jr. 1969. A nonlinear mapping 
for data structure analysis. IEEE Trans. 
Comput. 18, 401-409. 

SanGAL, R. 1991. Programming Paradigms in 
LISP. McGraw-Hill, Inc., New York, NY. 

SCHACHTER, B. J., DAVIS, L. S., AND ROSENFELD, 
A. 1979. Some experiments in image seg- 
mentation by clustering of local feature val- 
ues. Pattern Recogn. 11, 19-28. 

Schwefel, H. P. 1981. Numerical Optimization 
of Computer Models. John Wiley and Sons, 
Inc., New York, NY. 

Selim, S. Z. and Ismail, M. A. 1984. K-means- 
type algorithms: . A generalized convergence 
theorem and characterization of local opti- 
mality. IEEE Trans. Pattern Anal. Mach. In- 
tell. 6, 81-87. 

Selim, S. Z. and Alsultan, K. 1991. A simu- 
lated annealing algorithm for the clustering 
problem. Pattern Recogn. 24, 10 (1991), 
1003-1008. 

Sen, A. and Srivastava, M. 1990. Regression 

Analysis. . Springer-Verlag, New York, NY. 
Sethi, I. and Jain, A. K., Eds. 1991. Artificial 

Neural Networks and Pattern Recognition: 

Old and New Connections. Elsevier Science 

Inc., New York, NY. 
Shekar, B., Murty, N. M., and Krishna, G. 

1987. A knowledge-based clustering scheme. 

Pattern Recogn. Lett. 6, 4 (Apr. 1, 1987), 253- 

259. 

Silverman, J. F. and Cooper, D. B. 1988. 
Bayesian clustering for unsupervised estima- 
tion of surface and texture models. 
IEEE Trans. Pattern Anal. Mach. Intell. 10, 4 
(July 1988), 482-495. 

Simoudis, E. 1996. Reality check for data min- 
ing. IEEE Expert 11, 5 (Oct.), 26-33. 

Slagle, J. R„ Chang, C. L., and Hellek, S. R. 
1975. A clustering and data-reorganizing al- 
gorithm. IEEE Trans. Syst. Man Cybern. 6, 
125-128. 

Sneath, P. H. A. and Sokal, R. R. 1973. 
Numerical Taxonomy. Freeman, London, 
UK. 



Spath, H. 1980. Cluster Analysis Algorithms 
for Data Reduction and Classification. Ellis 
Horwood, Upper Saddle River, NJ. 

Solberg, A., Taxt, T., and Jain, A. 1996. A 
Markov random field model for classification 
of multisource satellite imagery. IEEE 
Trans. Geoscience and Remote Sensing 34, 1, 
100-113. 

Srivastava, A. and Murty, M. N 1990. A com- 
parison between conceptual clustering and 
conventional clustering. Pattern Recogn. 23, 
9 (1990), 975-981. 

Stahl, H. 1986. Cluster analysis of large data 
sets. In Classification as a Tool of Research, 
W. Gaul and M. Schader, Eds. Elsevier 
North-Holland, Inc., New York, NY, 423-430. 

Stepp, R. E. and Michalski, R. S. 1986. 
Conceptual clustering of structured objects: A 
goal-oriented approach. Artif. Intell. 28, 1 
(Feb. 1986), 43-69. 

Sutton, M., Stark, L., and Bowyek, k. 
1993. Function-based generic recognition for 
multiple object categories. In Three-Dimen- 
sional Object Recognition Systems, A. Jain 
and P. J. Flynn, Eds. Elsevier Science Inc., 
New York, NY. 

Symon, M. J. 1977. Clustering criterion and 
multi-variate normal mixture. Biometrics 
77, 35-43. 

Tanaka, E. 1995. Theoretical aspects of syntac- 
tic pattern recognition. Pattern Recogn. 28, 
1053-1061. 

Taxt, T. and Lundervold, A. 1994. Multi- 
spectral analysis of the brain using magnetic 
resonance imaging. IEEE Trans. Medical 
Imaging 13, 3, 470-481. 

Titterington, D. M., Smith, A. F. M., and Makov, 
U. E. 1985. Statistical Analysis of Finite 
Mixture Distributions. John Wiley and Sons, 
Inc., New York, NY. 

Toussaint, G. T. 1980. The relative neighbor- 
hood graph of a finite planar set. Pattern 
Recogn. 12, 261-268. 

Trier, O. D. and Jain, A. K. 1995. Goal- 
directed evaluation of binarization methods. 
IEEE Trans. Pattern Anal. Mach. Intell. 17, 
1191-1201. 

Ucjotama, T. and Arbib, M. A. 1994. Color image 
segmentation using competitive learning. 
IEEE Trans. Pattern Anal. Mach. Intell. 16, 12 
(Dec. 1994), 1197-1206. 

Urquhart, R. B. 1982. Graph theoretical clus- 
tering based on limited neighborhood 
sets. Pattern Recogn. 15, 173-187. 

VENKATESWARLU, N. B. and Raju, P. S. V. S. K. 
1992. Fast ISODATA clustering algorithms. 
Pattern Recogn. 25, 3 (Mar. 1992), 335-342. 

Vinod, V. V., Chaudhory, S., Mukherjee, J., and 
Ghose, S. 1994. A connectionist approach 
for clustering with applications in image 
analysis. IEEE Trans. Syst. Man Cybern. 24, 
365-384. 



ACM Computing Surveys, Vol. 31, No. 3, September 1999 



Data Clustering • 323 



WAH, B. W., Ed. 1996. Special section on min- 
ing of databases. IEEE Trans. Knowl. Data 
Eng. (Dec). 

Ward, J. H. Jr. 1963. Hierarchical grouping to 
optimize an objective function. J. Am. Stat. 
Assoc. 58, 236-244. 

WATANABE, S. 1985. Pattern Recognition: Hu- 
man and Mechanical. John Wiley and Sons, 
Inc., New York, NY. 

Weszka, J. 1978. A survey of threshold selec- 
tion techniques. Pattern Recogn. 7, 259-265. 

Whitley, D., Starkweather, T., and Fuquay, 
D. 1989. Scheduling problems and travel- 
ing salesman: the genetic edge recombina- 
tion. In Proceedings of the Third Interna- 
tional Conference on Genetic Algorithms 
(George Mason University, June 4-7), J. D. 
Schaffer, Ed. Morgan Kaufmann Publishers 
Inc., San Francisco, CA, 133-140. 

Wilson, D. R. and Martinez, T. R. 1997. 
Improved heterogeneous distance func- 
tions. J. Artif. Intell. Res. 6, 1-34. 

Wu, Z. and Leahy, R. 1993. An optimal graph 
theoretic approach to data clustering: Theory 



WULFEKUHLER, M. AND PUNCH, W. 1997. Finding 

salient features for personal web page categories. 
In Proceedings of the Sixth International Con- 
ference on the World Wide Web (Santa Clara, 
CA Apr.), http://theory.stanford.edu/people/ 
wass/publications/Web Search/Web Search.html. 
Zadeh, L. A. 1965. Fuzzy sets. Inf. Control 8, 



Zahn, C. T. 1971. Graph-theoretical methods 
for detecting and describing gestalt clusters. 
IEEE Trans. Comput. C-20 (Apr.), 68-86. 

Zhang, K. 1995. Algorithms for the constrained 
editing distance between ordered labeled 
trees and related problems. Pattern Recogn. 
28, 463-474. 

Zhang, J. and Michalski, R. S. 1995. An inte- 
gration of rule induction and exemplar-based 
learning for graded concepts. Mach. Learn. 
21, 3 (Dec. 1995), 235-267. 

Zhang, T., Ramakrishnan, R., and Livny, M. 
1996. BIRCH: An efficient data clustering 
method for very large databases. SIGMOD 
flee. 26, 2, 103-114. 

Zupan, J. 1982. Clustering of Large Data 
Sets. Research Studies Press Ltd., Taunton, 
UK. 



Received: March 1997; revised: October 1 998; accepted: January 1! 



ACM Computing Surveys, Vol. 31, No. 3, September 1999 



EXHIBIT C 



S 8* :? c- 3' fi" 8- 5* S- 2, S, B" ^ £• f e C. r 1 

irllniii^iiiiilrlliii 



as 



fill" 



! a ^ §' < ! 



I & e. 8L s 



i B B • g | 



5 & a * ~ 

•.ii g £ 

!. g H s. 




! & ^ ! J 



8 b- o & a S B s- ! 



I -s cS <S I ^ p I" p 



8 lf If 8:61" 
ST & <r 3. I 8'^ « 
a ? g g" a "„ H || 

3 g 8" g • 

2 b c ^ 



1 S- & I: I. 
3 2. g 3 ^ 



1 II 

~ §> I £ § § :<> J • 

8 S I 

111 1? 9 J ' 

»*•§. I III 

II? fill 3 



3 3S3 



* III ! 



11 jas 



« 8 " 

'5- e . o" 1 - 



Hi 



I 8 I 



3 * 

a : 

§1 > 



1-1 1 



8 I 

5 E 




3 8 

0 !5 ,2* 

P B § 173 

55 5 hS E? 




3 B E 0 s. e 2^ I & g* 

J^5o# lei S'l r ^1 
it 11*1 HI I t|l 

P - s 1*2 SI s g & I 2 H | 

• § <b." sis* 
il'&eilB-Iirifi 



H ST 

i. 



3S? 



e! 2 3 e <. ; 



f» a* ^ §f 
2. g I! * 



c 8 - -* - 



* »• g 8 3' 8 s* & is g 0 a; 

III iillf E+i^i 



, & 3 £ 2 s g-gf g S 



I 2 g g qp^g h % 2 g. § g. 



0 c 

fii 



is ? I « 1 1 % » I ^ .<=> 



o' 4 e 



o 5 ft S o o 

™ a g o p ^: & 

p b 2 I is. § 5< § 
S»S 2.° S K ^ « § S ^ S 

f£i?p* f if 28 1« 
s, I s I g- g I s 




n g. 









Case 














< 





























Is; s 



"§ o S 



* 5 •* 



I 

B 

o 

§ 
8 

as 

§ 



ffie 
ii 5 e 5 



5 e 



ill 

III 
III 
:•!! 

A 'a £ 

a s s- 



§ o o 



S3 18- 



If- 



ft 



s^* 1.2,1 f# *§■- 
ill?? 

«|B^1 Nils: 

0 o I |T| | * s, §. | 
§! g 1 if ^-^ g 3 g- 1 1 £ 

£ a I el £§'°*-: > I. ? 
^ £ & ° - I i' i § £ 3 

p,^,o jj jo o 2." 

1 a 1 1 • o £ g * g 



JIM 3 

-5 



E ? | & 



8 Nl 



I 



-1 



ill! 

If 11 



sis. 



I 8 



ill 







0.608 

-1,590 
0.235 
3.949 

-2.249 
2.704 

-2.473 
0.672 
0.262 
1.072 

-1.773 
0.537 






j (Class) ' 








3.240 
2.400 

-2.499 
2.608 

-3.458 
0.257 
2.569 
1.415 
1.410 

-2.653 
1.396 
3.286 
. -0.712 






(Class) 



? 

o » 
B & 



B M 



a* to m P a* 5- 

i * § B sr S- 

^ p S 8 2 
1 B* 



5 



3 

"MM 



I s 1 

J5 a. S 



8 p 8 t 

3 S ^ J 
a s, ee 



i-ii! 



f! t 

B3- % 



3 g 




4 



mm*! 



^ ri 1 1 g | F f | 



i 3 ii 



hit* 



5 O £ ?f 



^ % g if Q£ 8 

a g a. b s l-s 

=5>T3 x a «> a G 



Iffi^aEl IIP 



i do 



"2." sr a cf g" 



i 

P 

i 

5 




iS 

I& 



s 



la- 



eg 



1 f III 
i. i 



s 



a si 2 



Sff a' 

4? odS 



5 



3g 



1 


8 


•3* 


a 






eg 





i-3 



1 3 





I* 



° s 1 1 : 

II "3 3 i 



L 2, -< 2 3 , 
j || || I 
1 §" 3 B B. I 

: !. I It 

5 S I | 3- 



3" gi J | 

^ 5 s • 



B O 



fl, § 



8 §tf 



III ill g-il-ll? 
flgi|}r'?r*J 



illflUl 



3- g a a 5 



& a °< 3- M S 9 



?! il 



£!s s b 

X s- 5- ffU 
g*"f § S 



* 8 E 



g £7 B 3 

si! 

gill 

I 5" I J. 



; b- 3.1 



i £» S3 ! 



■ I ? 



S* 3j 



E s i 3 g 

, 2 3 g" ™ g g. 1 » * 



o o* 



i 1 ; I S s s b g- 1 

S. * » g 5 I 8 8 £ 

in-cS| p S " 5' si 

a£<r 5 &* 

r o 2 " u S o 

S?S 0 3 0 jf «, 

1 1 |«f §§■&§: 

e; S 1 ' 8^ ^ & 

3 # 3 I- 1 u |-g * & 

. - .... J-Ea-lSft . 
° 3 8 ; 



iff sin iemmurm 



5 3 H , 

Lag. 



g S H S' H S ?• £ 




I llio TSI^itj'g'g |||| s & M 



| g" S g 2 I SB 



8 Sj-g ■ 



"ill 



5 a a 

9; 
| 

S 3 s:? 



L a ' a* 8 8 "g 8- O <» 



II! 

° I S- f 8 g» i 1 1 9 < 



. , ^ .. < s $ -a g 

I g £ g s- B B" S ,se t S' S 



1 g 2 * a g < 



i |" I a- 3. . 3 §• o 



8 ft £ 



2. g- v: D 
' 1 g | 



jt f 3 1 i a i- 

is! if # s S 1 1 f 



; 3 o 
« <5S 



iff S! 

3 3. ? , 



I 8 | | HI 

' 12 R ? H i s.g-1 

5" § si w I J 1 1 | 



PETAL WIDTH — cm 





, S 45 8 ^ B. 



! Co * o | ff 



; 2, a- 5 



tils f 



ii 



3^ 

grew a- >- 

lit! 



Is 

3 S, 



^ o c 3 k- 



? i 3 



s. 3 



- b - 1 1 ~ « I * B B s 

lUllllIli III 
~ « s e. ° * a 5' o o. S 

8 £3 I 1 I I g,?? g g 
| | g g 5 3 3 ? f 

« 3 a. Ej g " £ 1" ?' ff- a S 

. , || J lays' 3 g g-l'g. 

; s 0 >^ ? g 2. & ^ g* ^ g 
1 1 g i" I i l §■ J | | P B o 

p li^i^f- til ?t 



3. 1 a 



E|lTff 
f |f i | 



fill 
I'll 



Isll-gli-SrSg'g! 

<« p a-. „ 3 og 5k-* 

|^ s^ll 1 1 s I g| s 
&§ * » g p ag i 

II 8 * S ? ? -a B 



si ^ ! 



S. & 2. 



* lj.ll' 

C a £ ^ era 



■ b s i' g" i S 
a^!45: 

a 3 § g: 1 B s"=" 
I- 3- ~ 1 5 3 cK" a. 



! 8 IS 



B ^ 



I- J- 5 1- 
- s" if g 



avB. 



Is- § 

• ° § z 
"II 3 



lfjtfri?*:Bts 

a & » & e. g & g- g g 

1l% 5 |r-2l§§-s» 
l'|-:H E lsi-'a| 
^►s » ""go J! r'g o s 
s.? I § | Z « i- 1- s 

g-^g.3t3 3 ol B'o $ 

^ I' 1 1| i" t * < ! 1 1 

llltlf ^ !c^t•t' 
^^l : !2« Si"?- 

g si g Ft p ' s * § a s g* 

§s = ^Ijislii? 



£ §" §" ^ 5" ° ^> gf ° P ^ § ^ o § ~- §. ff if s? _ 
2 s-ov^.b £s So*** g B-s o g a? 



"Ilia 

i. 1 l-P I 



5 E .3. ~ K I a ' 



PIS 



3. S. n. ° 2 ? 





8 Z & - ~ 0 
§ o 8 1 " ° ><Q. 



1- S. b | f> q g a 

I fulfil s-l 



[in! if 



I is 
Is § 



I £ - i i? 

1 5? 8 2. w 
if fit 

I I i 1. 1 



■g| I Ha La-. , ... 



1 1 it I 



I MM 



SIMILARITY SCALE 



SsS'l'R»°;f§s»§"i.Sji 

|l5f«.?f?8!§i?l;f.ff 

Sail if l-l t% S-f I fg-o 
; ^fti » &■ g ~ 3_ — q 



Iff I 

g- g 8 & 
•B. °- & S" 



1 1 § I 



t||l'S||Iii 

1 1' f I »• ! i 

t|fi||II1^?£i 

r«Q o 3-^ E5 a ^ y £L u ™ 



IA § s a | sr * ^ « § 

f „ J Sf B Si ^ w ™ S* 

ia 1 1 g. I ; -' * | s, | 

iaSIij^ 111 
2 ill! 111 



g fe te * g -5 5 
^ ? g p s a 
S " o- g ^ 3 {2. 



J If 1 1 

15! tl 



' S leg s s 



t ||j"t|l f 



r 



S 5 ! | Sog 



I Kill! 

« V a ? a c 

B § ° $ § Z % 



' > g £L 



>| go I 

2. a It? 



§ ° " I & i" 



* e *g f I. g & s* s 1 



o 



6« i.s a r« i£ 83 a 

EL K 2 o- g o S- 5 2 ° 



a- g. s, |: 3 « 



?f2.?Ss / rt B3'5 3 



r fr 

r| 

- *S"^ 9 S: i 



i § S Z 5 

; |jg 3 g-g. 
i s « a a p. 



III a! 



? s g - - « § 5- g g Z 



3 § s-l 



& 1 5 8 8 ET* S | 



[III 5 I? ^MiII5 
is I ia is 3 



o, a. o 



P. 



I "AS 
a a. t 



g. ia a i 



° K. 2 



| g: | ITS 1 1 j f 



! I K 




8 g'S S 

o 2. i &' 



3 ;I ^ 

**M 

| f 2, 3 

Sf fa 
iff r 
£ £-1 § 

S 0 3 § 
§ 8-1 §. 
3 52. ^T. cu 



5 s 8- a S ' 

1:1 



5 § f 
P § 8 



III! 



; mifrfijfii! 



;: g -5 
1 1 

f Si 1 f at 1 1 Si JfS 
MlPirlllPlfll 



l!pf!f?!F!t1 



- a I- S § 



f asif e ill , 



ill 



M 

S 3' 
I I 
|" I 





§"§ ?r "i. " 2. 2" a. 2 f «Ss 



2, S 
1 I 

^ 3 



H 



*3 Efl 

?! 
If 

II 

o o 



I' ST 

>! 

Si 




B-g .- 

III 

IP! 

O*. 3* rt- 

^11 



ill 



3 § » 



3. ^S^Sff 3 r+3 So 3 o 

ft pt ii mm 



n | O g- " I 3 S" 8,o5 



I 2. 

« S ^ "> ,3 S o 1 ^ 



o. CS g g | 



3 -c S- 



g g 3 
< E. g 

3^ 3- I 



Ill 



llil III 



eg 
II 

25 1 



C3 



o £ 3 s a " 3 a > 

lllllilllilf] 



1 2 1 ^ H I 



° " & I o. § B g. 



^ cr B b er. _^ 
" g g d3 ? n 3 

••si 



llllll 

Is III! 

§• f >H I I 8 

n * g; a g < Q . „ „ s „ „ c 
*S 3 2 ^ * 3 8 § 2. " 5 

§ S B S<5 a; ? |" & I" f B I 
S** § t, o S* frE 

3' " a 5' o^C? » «■ Jr a S c 

1 5 8 1 a § % s. §• I ? I 

g- 7; g 2, 5 <° °* ^ ^ ™ ° 
d § c£ ?g 8 

era f O o 0 o. 



6- & c 



ill I 

3 EC ~ 8. 

I'll i 




trg" ill... 

! » S' » ° g, £ § era £, jf 




! a 1 1. 1 2 h s ^ ~ * »• s-'s r r ~ s § i | g- 

s 5T 3 .8. a- g g. | g g. g » ° p. 8 e £ r ' " - - ~ - " ~ ~ ~ - ° * : - " 



S £ s. a o g * g §■ & if i|gs§"*l|gB»8& 

<» 2 s K R 8 E3.a-p& Mo o s T'tr)S. 2.S.'2»B h a f 5n 



r - , : g o c « & o J- g o q g. & 3 g ? g. g. „ 3 

* <■ S S b. p. «u - oSSSo^««™oa. K-PwS-xt 



Hi II II . P I IP I II I PUIS 



■is?*: I l8j:i?i?s& - 
§•2,5-115 at s.cg h » s- I 



o m^^S Jl O u tro QaS~^ * ^ £ *■ EL £2 o S> d n ' 

" » » 1 I ff § I 3-3 J & & S 1 3 ff J? S I" £ 8! 8 | & 



! g- B f ' 8 g f » 



8 a | ? 8 
" & I" §" 8 




1.3 8. 



ft If 1 |J ff t I-l i it n El ? I- 1 f 5 1. f If in ^ 




E? § I B ° * a | c 5 g g * s. g s £ i |— 0 |. s 8 ft a ^ g g ^ > 

si ^1 b~p§ g-^l^i sf 2 ^ *§f s&i-rS 

i!f!^!r!i!!II?iIiHPi!ti! II l^mmm 

;ii|III!Hl!|l il il Iff!! ff gf I'll! 

1 1 i5 II Iff f I If t tf 1 i|f f i? ft ll |1 III 




^1 sg|^ss§8;;j 



3? 



5* 



H I ?> §. y |^ 5 ^ $ 2 H 
= g § g » £ 5j g ! > | |S" > 

> ^ g a ri 

j ■-< O 5' c 
2 2" & | S f £ M 

• i. Is? ~ i 2s 
' 1- f I - B > g J r I 

lit HHP ,„ 

i $ s- S J- - 3 ■ 5" 



S 2 o- 

III 
III 



^3 > 



5> 



r 1 

2- ft 



£5 
as 



Iffsrif £iist|8 asgseisa b 

£>§ si 1 5.1 b-Z^J § f § b. e 

1.^Ib list's fol-g 3 S5|S5 5 e 
' 2 tt ; § I. g. s, r- o 3 : £ g a 1 



5 8 



I > ^ 1 1 n ! ► 

15 M y&ifl tk 1-1- |L»1° 

5 & o Sf 



ff if a. *> 



IttalM^ M tj-sal |n-^ |-f Sj- fi 5| 



III 

"8 5^ 



Bp 



I' v|§ as S| gg- § g. < 

» s e a - B a " Is g- ^iss- 

^- a5 !l 11 I f ? I ? I 

*g£l* I I i s I ^ 



11.1 § 



III! sen 
lifltni 

infill 



if tail 

Si Hi J 



§1 



- 3 8 * ? S - 

;iif!ii 

h II if 



^1 f " § in ? §• s 



5 3 



?1 



EXHIBIT D 



Measures of Central Tendency 



In samples, as well as in populations, one generally finds a preponderance of values 
somewhere around the middle of the range of observed values. The description of this 
concentration near the middle is an average, or a measure of central tendency to the 
statistician. It is also termed a measure of location, for it indicates where, along the 
measurement scale, the sample or population is located. 

Various measures of central tendency are useful parameters, in that they describe 
a property of populations. This chapter discusses the characteristics of these parameters 
and the sample statistics that are good estimates of them. 



3.1 The Arithmetic Mean 

The most widely used measure of central tendency is the arithmetic mean* usually re- 
ferred to simply as the mean) which is the measure most commonly called an "average." 

Each measurement in a population may be referred to as an X it (read "X sub i") 
value. Thus, one measurement might be denoted as X\, another as X 2 , another as X3, 
and so on. The subscript i might be any integer value up through N, the total number of 



*As an adjective, "arithmetic" is pronounced with the accent on the third syllable. In early literature on 
the subject, the adjective "arithmetical" was employed. 

tThe term "mean" (the arithmetic mean, as well as the geometric and harmonic means of Section 3.5) 
dates from ancient Greece (Walker, 1929: 183). 
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Section 3.1 The Arithmetic Mean 
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X values in the population.* The mean of the population is denoted by the Greek letter 
fi (lowercase mu), and is calculated as the sum of all the X t values divided by the size 
of the population. 

The calculation of the population mean can be abbreviated concisely by the formula 

N 

Ex, 

*- i ir- <3.i) 

The Greek letter £ (capital sigma) means "summation"* and Ylf=] % means "summation 
of all Xi values from X\ through Xm" Thus, for example, Yj=\ x i = X1 + X2+X3+X4 
and 5Z)f=3 %i — Xj + X4 + X5. Since, in statistical computations, summations are nearly 
always performed over the entire set of X t values, this book will assume J^X t to 
mean "sum X,'s over all values of i" simply as a matter of printing convenience, and 
fji — Xi/N would therefore designate the same calculation as would /x = YliL\ Xi/N. 

The most efficient, unbiased, and consistent estimate of the population mean, fx, is 
the sample mean, denoted as X (read as "X bar"). Whereas the size of the population 
(which we generally do not know) is denoted as N, the size of a sample is indicated by 
n, and X is calculated as 

X=<=L_ or * = (3.2) 
' n n 

which is read "the sample mean equals the sum of all measurements in the sample 
divided by the number of measurements in the sample. "* Example 3.1 demonstrates the 
calculation of the sample mean. Note that the mean has the same units of measurement 
as do the individual observations. The question of how many decimal places should be 
reported for the mean will be answered at the end of Section 6.3; until then we shall 
simply record the mean with one more decimal place than the data. 

If, as in Example 3.1, a sample contains multiple identical data for several values 
of the variable, then it may be convenient to record the data in the form of a frequency 



*Charles Babbage (1792-1871) was an English mathematician and inventor, who conceived principles 
used by modern computers — well before the advent of electronics — and who, in 1832, proposed the modern 
convention of italicizing letters to denote quantities (Cajori, 1929: 2, 6). 

^Swiss mathematician Leonhard Euler, in 1755, was the first to use E to denote summation (Cajori, 
1928: 2, 61). 

*The modern symbols for plus and minus arose in Germany during the 1480s, with Johann Widman 
the first to use them in print (Cajori, 1928: 222-223). The modern equal sign was invented by English 
mathematician Robert Recorde, who published it in 1577 along with the first appearance of "+" and "— " in an 
English work; and it became standard about the start of the eighteenth century (Cajori, 1928: 164, 232-233). 
Many other symbols were used for mathematical operations, before and after these introductions (e.g., ibid.: 
229-245). Using a horizontal line to express division derives from its use, in denoting fractions, by Arabic 
author Al-HassSr in the twelfth century, though it was not consitently employed for several more centuries 
(ibid.: 269, 310). The solidus ("/") was recommended for division by the English writer Augustus De Morgan 
in 1845 (ibid.: 312-313), and the Swiss author Johann Heinrich Rahn proposed, in 1659, the symbol 'V 
which was previously often used by authors as a minus sign (ibid.: 211, 270). 



Measures of Central Tendency Chapter 3 



EXAMPLE 3.1 A sample of 24 from a population of butterfly wing lengths. 

X ( (in centimeters): 3.3, 3.5, 3.6, 3.6, 3.7, 3.8, 3.8, 3.8, 3.9, 3.9, 3.9, 4.0, 4.0, 4.0, 4.0, 4.1, 4.1, 4.1, 



4.2, 4.2, 4.3, 4.3, 4.4, 4.5. 



Y^Xi =95.0 cm 
n = 24 



table, as in Example 3.2. Then X, can be said to denote each of k different measurements 
and ft can denote the frequency with which that X { occurs in the sample. The sample 
mean may then be calculated, using the sums of the products of /; and X,-, as* 

X = — . (3.3) 

n 

Example 3.2 demonstrates this calculation for the same data as in Example 3.1. 



EXAMPLE 3.2 The data from Example 3.1 recorded as a frequency table. 



X, (cm) 


fi 


fiXi (cm) 


3.3 




3.3 


3.4 


0 


0 


3.5 


1 


3.5 


3.6 


2 


7.2 


3.7 


1 


3.7 


3.8 


3 


11.4 


3.9 


3 


11.7 


4.0 


4 


16.0 


4.1 


3 


12.3 


4.2 


2 


8.4 


4.3 


2 


8.6 


4.4 


1 


4.4 


4.5 


1 


4.5 



X 



TfiXi 

fr{ 95.0 cm _ 



24 

median = 3.95 cm + (\) (0.1 cm) 
= 3.95 cm + 0.025 cm 
= 3.975 cm 



£/,=24 = 95 - 0cm 



*Denoting the multiplication of two quantities (e.g., a amd b) by their adjacent placement (i.e., ab) 
derives from very old practices as far back as Hindu manuscripts from the seventh century (Cajori, 1928: 77, 
250). Modem usage also includes Gottfried Wilhelm Leibniz's 1698 recommendation of a dot: a ■ b (ibid.: 
267) and William Oughtred's 1631 suggestion of St. Andrew's cross: a x b (ibid.: 251). The 1659 use of an 
asterisk-like symbol "*" (ibid.: 212-213) did not persist but resurfaced in electronic computer languages of 
the latter half of the twentieth century. 
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Figure 3.1 A histogram of the data in 
Example 3.2. The mean (3.96 cm) is the 
center of gravity of the histogram, and the 
median (3.975 cm) divides the histogram 
into two equal areas. 



0 



3.3 3.4 3.5 3.C 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 

Wing Length (X t ) in cm 



If data are plotted as a histogram (Fig. 3.1), the mean is the center of gravity of 
the histogram * That is, if the histogram were made of a solid material, it would balance 
horizontally with the fulcrum at X. The mean is applicable to ratio or interval scale data. 



3.2 The Median 

The median is typically defined as the middle measurement in an ordered set of data.* 
That is, there are just as many observations larger than the median as there are smaller. 
The sample median is the best estimate of the population median. In a symmetrical 
distribution (Figs. 3.2a and 3.2b) the sample median is also an unbiased and consistent 
estimate of /z. but it is not as efficient a statistic as X, and should not be used as a 
substitute for X. (If the frequency distribution is asymmetrical, the median is a poor 
estimate of the mean.) 

The median (M) of a sample of data may be found by first arranging the mea- 
surements in order of magnitude. The order may be either ascending or descending, but 
ascending order is most commonly used as is done with the samples in Examples 3.1, 
3.2, and 3.3. Then, we define the sample median as 



If the sample size (n) is odd, then the subscript in Equation 3.4 will be an integer and will 
indicate which datum is the middle measurement in the ordered sample. For the data of 
species A in Example 3.3, n = 9 and the sample median is = X (n+ i)/2 = X {9+X)/2 = 
X 5 = 40 mo. If n is even, then the subscript in Equation 3.4 will be a half-integer, a 
number midway between two integers. This indicates that there is not a middle value in 
the ordered list of data; instead, there are two middle values, and the median is defined 
as the midpoint between them. For the species B data in Example 3.3, n = 10 and 



*The concept of the mean as the center of gravity was used by L. A. J. Quetelet in 1846 (Walker, 
1929: 73). 

^The concept of the median was conceived as early as 1816, by K. F. Gauss; enunciated and reinforced 
by others, including F. Galton in 1869 and 1874; and independently discovered and promoted by G. T. Fechner 
beginning in 1874 (Walker, 1929: 83-88, 184). 



M = X (n+m . 



(3.4) 
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Figure 3.2 Frequency distributions showing measures of central tendency. Values of the 
variable are along the abscissa (horizontal axis), and the frequencies are along the ordinate 
(vertical axis). Distributions (a) and (b) are symmetrical, (c) is positively skewed, and 
(d) is negatively skewed. Distributions (a), (c), and (d) are unimodal, and distribution 
(b) is bimodal. In a unimodal asymmetric distribution, the median lies about one-third 
the distance between the mean and the mode * 



EXAMPLE 3.3 Life expectancy for two hypothetical species of birds in captivity. 


Species A 


Species B 


X t (mo) 


Xi (mo) 


34 


34 


36 


36 


37 


37 


39 


39 


40 


40 


41 


41 


42 


42 


43 


43 


79 


44 




45 


n = 9 


n =10 


M = X(„+l)/2 = X(9+l)/2 


M = X(„+i)/2 = X(l0+l)/2 


= x 5 = 40 mo 


= X 5 .5 =40.5 mo 


X = 43.4 mo 


X =40.1 mo 



*An interesting relationship between the mean, median, and standard deviation is shown in Equation 4. 14. 
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X(n+i)/2 — -^(io+i)/2 = ^5.5 . which signifies that the median is midway between X 5 and 
X 6 , namely M = (40 mo + 41 mo)/2 = 40.5 mo. 

Note that the median has the same units as each individual measurement. If data 
are plotted as a frequency histogram (e.g., Fig. 3.1), the median is the value of X that 
divides the area of the histogram into two equal parts. The sample median is, in general, 
a more efficient estimate of the population median for larger sample sizes. 

If we find the middle value(s) in an ordered set of data to be among identical 
observations (referred to as tied values), as in Example 3.1 or 3.2, a difficulty arises. 
If we apply Equation 3.4 to these twenty-four data, then we conclude the median to be 
Xus = 4,0 cm. But four data are tied at 4.0 cm, and eleven measurements are less than 
4.0 cm and nine are greater. Thus, 4.0 cm does not fit the definition above of the median 
as that value for which there is the same number of data larger and smaller. Therefore, 
a better definition of the median of a set of data is that value for which no more than 
half the data are smaller and no more than half are larger. 

When the sample median falls among tied observations, we may interpolate to better 
estimate the population median. Using the data of Example 3.2, we desire to estimate a 
value below which 50% of the observations in the population lie. Fifty percent of the 
observations in the sample would be twelve observations. As the first 7 classes in the 
frequency table include 11 observations and 4 observations are in class 4.0 cm, we know 
that the desired sample median lies within the range of 3.95 to 4.05 cm. Assuming that 
the four observations in class 4.0 cm are distributed evenly within the 0.1 cm range of 
3.95 to 4.05 cm, then the median will be (J)(0.1 cm) = 0.025 cm into this class. Thus, 
the median = 3.95 cm + 0.025 cm = 3.975 cm. In general, for the sample median 
within a class interval containing tied observations, 

M= f lower limit \ / 0.5n - cum, freq. \ / interval \ 

\ of interval J \no. of observations in interval / \ size J ' ^ ' 

where "cum. freq." refers to the cumulative frequency of the previous classes* By 
using this procedure, the calculated median will be the value of X that divides the area 
of the histogram of the sample into two equal parts. As another example, refer back 
to Example 1.5, where, by Equation 3.5: median = M = 8.75 mg/g + {[(0.5)(130) - 
61]/24}{0.10 mg/g) = 8.75 mg/g + 0.02 mg/g= 8.77 mg/g. 

The median expresses less information than does the mean, for it does not take 
into account the actual value of each measurement, but only considers the rank of each 
measurement. Still, it may offer certain advantages in some situations. First, it is 
plain from the two samples in Example 3.3 that extremely high (or extremely low) 
measurements will not affect the median as much as they affect the mean (causing the 
sample median to be called a "resistant" statistic). Thus, when we deal with skewed 
populations, we may prefer the median to the mean to express central tendency. 

Note that in Example 3.3 the researcher would have to wait 79 mo to compute a 
mean life expectancy for species A (45 mo for species B), whereas the median could be 



*This procedure was enunciated in 1878 by the German scholar, Gustav Theodor Fechner (1801-1887) 
(Walker, 1929: 86). 
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determined in only 40 mo (41 mo for species B). Also, to calculate a median one does 
n~ Z Accurate data for all members of the sample. If we did not have *e first 
teee data for species A accurately recorded, but could state them as less han 39 mo 
to th ^ median could have been determined just as readily, although calculations of the 
mean would not have been possible. Lastly, the median can be determined not only for 
mteTvaT and ratio scale data" but also for data on the ordinal scale, data for which the 
use of the mean usually would not be considered appropriate. 



3.3 Other Quantiles 

Tust as the median is the value above and below which lies half the set of data, one can 
de fin« 

if the data are divided into four equal parts, we speak of quartiles. 

One flfe of all the ranked observations are smaller than the first quarttle one- 
fourth^e betwl the first and second quartiles, one-fourth lie between the second and 
SfqS^and one-fourth are larger than the third quartil. The ^second o^artile i 
dentical to the median. As with the median, the first and third quartiles might be one 
of me data or the midpoint between two of the data. The first quartile, Q u is 

Gi = X(» + D/4; (3 - 6) 

if the subscript (n + 1) /4, is not an integer or half-integer, then it is rounded up to the 
The second quartile is the median M, and the subscript 
on X for the third quartile, Qi, is 

n + 1 - subscript on X for gi- ( 3 - 7 ) 

Examining the data in Example 3.3: For species A n - . 9M + f = 
O, = Xos = 36.5 mo; and Q 3 = X 10 _2.5 = X ^- 5 ~ ^ m r Z, F a n - 
10, (n + l)/4 = 2.75 (which we round up to 3), and Ql = X 3 = 37 mo, and & - 

ZU " 3 sTmIlarry 43 vaTue S that partition the ordered data set into eight equal parts (or as 
equal as n will allow) are called octiles. The first octile, O u is 

Ox = x (n+ D/8; (3 - 8) 

and if the subscript in + l)/8, is not an integer or half-integer, then it is rounded up to 
third octile, 0 3 , is 

2(subscript on X for Q x ) - subscript on X for O x \ (3-9) 
the subscript on X for the fifth octile, 0 5 , is 

n + 1 - subscript on X for 0 3 ; ( 3 - 10) 
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and the subscript on X for the seventh octile, 0 7 , is 

n + 1 -subscript on X for 0\. (3.11) 
Thus, for the data of Example 3.3: For species A, n = 9, (n + l)/8 = 1.5 and O x = 
X L5 = 35 mo; 2(2.5) - 1.5 = 3.5, so 0 3 = X 3 . 5 = 38 mo; n + 1 - 3.5 = 6.5, 
so Q 5 = Z 6 . 5 = 41.5 mo; and n + 1 - 1.5 = 8.5, so O n = 61. For species B, 
n = 10, (n + l)/8 = 1.25 (which we round up to 1.5) and O x = X 15 = 35 mo; 
2(3) -1.5 = 4.5, so 0 3 = X 4 . 5 = 39.5 mo; n + 1 - 4.5 = 6.5, so O s = X 6 5 = 41 .5 mo;' 
and n + 1 - 1.5 = 9.5, so 0 7 = 44.5 mo. 

Besides the median, quartiles, and octiles, ordered data may be divided into fifths, 
tenths, or hundredths by quantities that are respectively called quintiles, deciles, and 
centiles (the latter also called percentiles). Measures that divide a group of ordered 
data into equal parts are collectively termed quantiles* The expression "LD 50 ," used in 
some areas of biological research, is simply the 50th percentile of the lethal doses, or 
the median lethal dose. That is, 50% of the experimental subjects survived this dose, 
whereas 50% did not. Likewise, "LC 50 " is the median lethal concentration, or the 50th 
percentile of the lethal concentrations. 



3.4 The Mode 

The mode is commonly defined as the most frequently occurring measurement in a set 
of data.* In Example 3.2, the mode is 4.0 cm. But it is perhaps better to define a mode 
as a measurement of relatively great concentration, for some frequency distributions may 
have more than one such point of concentration, even though these concentrations might 
not contain precisely the same frequencies. Thus, a sample consisting of the data: 6, 7 
7, 8, 8, 8, 8, 8, 8, 9, 9, 10, 11, 12, 12, 12, 12, 12, 13, 13, and 14 mm would be said 
to have two modes: at 8 mm and 12 mm. (Some authors would refer to 8 mm as the 
"major mode" and call 12 mm the "minor mode.") A distribution in which each different 
measurement occurs with equal frequency is said to have no mode. If two consecutive 
values of X have frequencies great enough to declare the X values modes, the mode 
of the distribution is said to be the midpoint of these two Z's; e.g., the mode of 3, 5, 
7, 7, 7, 8, 8, 8, and 10 liters is 7.5 liters. A distribution with two modes is said to 
be bimodal (e.g., Fig. 3.2b) and may indicate a combination of two distributions with 
different modes (e.g., heights of men and women). Modes are readily discerned from 
histograms or frequency polygons. 

The sample mode is the best estimate of the population mode. When we sample a 
symmetrical unimodal population, the mode is an unbiased and consistent estimate of the 
mean and median (Fig. 3.2a), but it is relatively inefficient and should not be so used. 



*Sir Francis Galton developed the concept of percentiles, quartiles, deciles, and other quantiles in writ- 
ings from 1869 to 1885 (Walker, 1929: 86-87, 177, 179). The term "quantile" was introduced in 1940 by 
M. G. Kendall (David, 1995). 

tThe term "mode" was introduced by Karl Pearson in 1894 (Walker, 1929: 184). 
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As a measure of central tendency, the mode is affected by skewness less than is the 
mean or the median, but it is more affected by sampling and grouping than these other 
two measures. The mode, but neither the median nor the mean, may be used for data 
on the nominal, as well as the ordinal, interval, and ratio scales of measurement. In a 
unimodal asymmetric distribution (Figs. 3.2c and 3.2d), the median lies about one-third 
the distance between the mean and the mode. 

The mode is not often used in biological research, although it is often interesting 
to report the number of modes detected in a population, if there are more than one. 



3.5 Other Measures of Central Tendency 

The range midpoint, or midrange, is also a measure of central tendency, being half-way 
between the highest and lowest values in the set of data. It is not to be considered 
a good estimate of the mean and is a seldom-used measure, for it utilizes relatively 
little information from the data (although the so-called "mean" daily temperature is often 
reported as the mean of the minimum and maximum, and is thus a range midpoint). 
The mean of any two symmetrically located percentiles, such as the mean of the first 
and third quartiles (i.e., the 25th and 75th percentiles), may be used in the same fashion 
as the range midpoint as a measure of central tendency (see Dixon and Massey, 1969: 
133-134), and it is not as adversely affected by aberrantly extreme values. But such a 
procedure is seldom encountered. As such measures are based on quantiles, they may 
be applied to either ratio, interval, or ordinal data. 

The geometric mean is the nth root* of the product of the n data: 



x G = yx 1 x 2 x 3 ...x n = ^Jl x i> (3- 12 > 

Capital Greek pi, IT, means "take the product" in an analogous fashion as £ indicates 
"take the sum." The geometric mean may also be calculated as the antilogarithm of the 
arithmetic mean of the logarithms of the data (where the logarithms may be in any base); 
this is much more feasible computationally: 

v / log*i+log*2 + ---+logM 5 l0gX '' 

X G = antilog = antilog . (3.13) 

V n J n 

The geometric mean is appropriate only when all the data are positive. If the data are all 
equal, then the geometric and arithmetic means are identical; otherwise,* X G < X. This 
measure finds use in averaging ratios where it is desired to give each ratio equal weight, 



*Denoting the nth root as nj was suggested by Albert Girard as early as 1629, but this symbol was not 
generally used until well into the eighteenth century (Cajori, 1928: 371-372). 

^The symbols "<" (meaning "less than") and ">" (meaning "greater than") were invented in 1631 by 
the English writer Thomas Harriot (Cajori, 1928: 199). 
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and in averaging percent changes, discussions of which are found in Croxton, Cowden 
and Klein (1967: 178-182). 

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of 
the data: 



*h = i p = p. (3.14) 

-E— Y — 

It is occasionally used when dealing with averaging rates, as described by Croxton, 
Cowden, and Klein (1967: 182-188). If all the data are identical, then the harmonic 
mean is the same as the arithmetic mean (and also the same as the geometric mean). If 
they are positive but not identical, then X H < X G < X. 

The geometric and harmonic means are appropriate only for ratio scale data. They 
are rarely encountered and the term "mean" typically implies "arithmetic mean." 



3.6 The Effect of Coding Data 

Often in the manipulation of data, considerable time and effort can be saved if coding is 
employed. Coding is the conversion of the original measurements into easier-to-work- 
with values by simple arithmetic operations. Generally coding employs a linear transfor- 
mation of the data, such as multiplying (or dividing) or adding (or subtracting) a constant. 
The addition or subtraction of a constant is sometimes termed a translation of the data 
(i.e., changing the origin), whereas the multiplication or division by a constant causes an 
expansion or contraction of the scale of measurement. The first set of data in Example 3.4 
are coded by subtracting a constant value of 840 g. Not only is each coded value equal 
to X { - 840 g, but the mean of the coded values is equal to X - 840 g. Thus, the 
easier-to-work-with coded values may be used to calculate a mean that then is readily 
converted to the mean of the original data, simply by adding back the coding constant. 
In Sample 2 of Example 3.4, the observed data are coded by dividing each observation 
by 1000 (i.e., by multiplying by 0.001).* The resultant mean only needs to be multiplied 
by the coding factor of 1000 (i.e., divided by 0.001) to arrive at the mean of the original 
data. As the other measures of central tendency have the same units as the mean, they 
are affected by coding in exactly the same fashion. For calculations more involved than 
computing means, the advantages of coding will become more apparent. In general, 
linear transformations of ratio or interval scale data will not affect the hypothesis tests 
to be described later. 

In general, if we code X by addition of a constant, A, the coded X is 

[Xt] = Xi + A. (3.15) 

*In 1593, the German astronomer Christoph Clavius (1537-1612) became the first to use a decimal 
point to separate units from tenths; in 1617, the Scottish mathematician John Napier (1550-1617) used both 
points and commas for this purpose (Cajori, 1928: 322-323), and the comma is still used in some parts of 
the world. In some countries a raised dot has been used— a symbol Americans occasinally employ to denote 
multiplication. 
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Coding data to facilitate calculations. 




Sample 1 


(Coding by Subtraction: 


Sample 2 (Coding by Division: 




A = -840 g) 


. M = 


0.001 liters/ml) 


(g) 


[X t ] = Xi - 840 g 


Xi (ml) 


[Xi] = (Xi) (0.001 liters/ml) 
= [Xi] liters 


842 


2 


8,000 


8.000 


844 


4 


9,000 


9.000 


846 


6 






846 


6 


11,000 


11.000 


847 


7 






848 


8 


13^000 


13.000 


849 


9 






y ^ 5922 


g = 42 g 


J] X, =63,000 ml 


= 63.000 liters 


5922 




X = 10,500 ml 


[X] = 10.500 liters 


= 846 g 


= 6 g 




_ ffl 




X = [X] — A 


X 








M 




= 6g-(-840 g) 




10.500 liters 




= 846g 




0.001 liters/ml 
= 10,500 ml 



In Sample 1 of Example 3.4, A = —840 g. The mean of a set of data thus coded is 

[X] = X + A; (3.16) 

so if one has calculated [X] using coded data, it is a simple matter to determine what 
the sample mean would have been if the data had not been coded, namely 

X = [X]-A. (3.17) 

If one codes X by multiplying by a constant, M, then each coded datum is 

[Xt] = MXi. (3.18) 

In Sample 2 of Example 3.4, M = 1/1000 = 0.001 liters/ml. The mean of the coded 
data is 

[X] = MX. (3.19) 
Knowing [X], one can determine that the mean of the uncoded data is 

X=§. (3.20) 
Coding affects the median and mode in the same way as the mean is affected. 
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